Bug 2365590
| Summary: | [NFS-Ganesha][HAProxy-Protocol] VIP failback to original node causes I/O failures with "Remote I/O error" on clients | |||
|---|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat Ceph Storage | Reporter: | Manisha Saini <msaini> | |
| Component: | Cephadm | Assignee: | Shweta Bhosale <shbhosal> | |
| Status: | CLOSED ERRATA | QA Contact: | Manisha Saini <msaini> | |
| Severity: | high | Docs Contact: | Rivka Pollack <rpollack> | |
| Priority: | high | |||
| Version: | 8.1 | CC: | akane, cephqe-warriors, gouthamr, kkeithle, rpollack, sabose, spunadik, tserlin | |
| Target Milestone: | --- | |||
| Target Release: | 8.1z1 | |||
| Hardware: | Unspecified | |||
| OS: | Unspecified | |||
| Whiteboard: | ||||
| Fixed In Version: | ceph-19.2.1-232.el9cp | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | ||
| Clone Of: | ||||
| : | 2370197 (view as bug list) | Environment: | ||
| Last Closed: | 2025-08-18 14:00:39 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | ||||
| Bug Blocks: | 2370197 | |||
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat Ceph Storage 8.1 security and bug fix updates), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2025:14015 |
Description of problem: ========= An NFS-Ganesha setup with HAProxy and VIP-based access is configured in active-passive mode. Two NFS exports are created and mounted via the VIP on two Linux clients. When the active Ganesha node (holding the VIP) is rebooted, the VIP correctly fails over to the standby node, and I/O continues as expected after the grace period. However, once the original node (Node1) comes back online and reclaims the VIP, I/O operations on the clients hang temporarily and eventually fail with "Remote I/O error". Perform below steps --> =================== 1. Reboot the active node - Node1 - Cali016 (on which the VIP was assigned) Observation --> VIP is now assigned to standby Node (Node2- Cali020) and IO's continue on mount point after ganesha comes out of Grace period 2. Now When Node1 comes back online, VIP (10.8.130.51/22) now moves to node1. IO's are hung on mount point for some time and later the IO's failed with "Remote I/O error" --------------- tar: Unexpected EOF in archive tar: linux-6.4/arch/powerpc/configs: Cannot utime: Remote I/O error tar: linux-6.4/arch/powerpc/configs: Cannot change ownership to uid 0, gid 0: Remote I/O error tar: linux-6.4/arch/powerpc/configs: Cannot change mode to rwxrwxr-x: Remote I/O error tar: linux-6.4/arch/powerpc: Cannot utime: Remote I/O error tar: linux-6.4/arch/powerpc: Cannot change ownership to uid 0, gid 0: Remote I/O error tar: linux-6.4/arch/powerpc: Cannot change mode to rwxrwxr-x: Remote I/O error tar: linux-6.4/arch/arm64/boot/dts/arm: Cannot stat: Remote I/O error ------------ ls mount point ---- [root@argo022 ganesha]# ls ls: cannot open directory '.': Remote I/O error ------ [ceph: root@cali013 /]# ceph orch ps | grep nfs haproxy.nfs.nfsganesha.cali016.uebnvn cali016 *:2049,9049 running (21m) 54s ago 35m 42.3M - 2.4.22-f8e3218 6c223bddea69 925fe8677810 keepalived.nfs.nfsganesha.cali016.gqrqtf cali016 running (21m) 54s ago 35m 1551k - 2.2.8 09859a486cb9 2536958c3e0f nfs.nfsganesha.0.2.cali020.gqxkvj cali020 *:12049 running (23m) 55s ago 23m 263M - 6.5 c0866a09d082 131f65723caf Version-Release number of selected component (if applicable): ================================== # ceph --version ceph version 19.2.1-188.el9cp (834ac46f780fbdc2ac4ba4851a36db6df3c1aa6f) squid (stable) # rpm -qa | grep nfs libnfsidmap-2.5.4-27.el9_5.1.x86_64 nfs-utils-2.5.4-27.el9_5.1.x86_64 nfs-ganesha-selinux-6.5-13.el9cp.noarch nfs-ganesha-6.5-13.el9cp.x86_64 nfs-ganesha-ceph-6.5-13.el9cp.x86_64 nfs-ganesha-rados-grace-6.5-13.el9cp.x86_64 nfs-ganesha-rados-urls-6.5-13.el9cp.x86_64 nfs-ganesha-rgw-6.5-13.el9cp.x86_64 How reproducible: ============= 1/1 Steps to Reproduce: =================== 1. Configure NFS-Ganesha with HAProxy protocol and VIP-based failover. 2. Create two NFS exports. 3. Mount both exports on two clients using the VIP (e.g. 10.8.130.51). 4. On both clients, run linux untars 5. Reboot the active Ganesha node (Node1) which holds the VIP. 6. Observe VIP moves to standby node (Node2), and I/O continues post-grace period. 7. Allow Node1 to come back online. 8. Observe VIP moves back to Node1. Actual results: =========== Upon Node1's return, the VIP moves back to it. I/O on the clients temporarily hangs and then fails with: Remote I/O error Expected results: ============ VIP failback to Node1 should occur without any hangs or errors Additional info: =============== # ceph nfs cluster info nfsganesha { "nfsganesha": { "backend": [ { "hostname": "cali020", "ip": "10.8.130.20", "port": 12049 } ], "ingress_mode": "haproxy-protocol", "monitor_port": 9049, "port": 2049, "virtual_ip": "10.8.130.51" } } # ceph orch ls | grep nfs ingress.nfs.nfsganesha 10.8.130.51:2049,9049 2/2 9m ago 44m cali016;cali020;count:1 nfs.nfsganesha ?:12049 1/1 9m ago 44m cali016;cali020;count:1 ganesha.log ------ May 12 05:27:42 cali020 ceph-49a987b4-2b40-11f0-aadb-b49691cee574-nfs-nfsganesha-0-2-cali020-gqxkvj[2401871]: 12/05/2025 05:27:42 : epoch 68217efc : cali020 : ganesha.nfsd-2[svc_117] rpc :TIRPC :EVENT :handle_haproxy_header: 0x7fe21400c870 fd 86 proxy header rest len failed header rlen = % (will set dead) May 12 05:27:44 cali020 ceph-49a987b4-2b40-11f0-aadb-b49691cee574-nfs-nfsganesha-0-2-cali020-gqxkvj[2401871]: 12/05/2025 05:27:44 : epoch 68217efc : cali020 : ganesha.nfsd-2[svc_124] rpc :TIRPC :EVENT :handle_haproxy_header: 0x7fe210005820 fd 86 proxy header rest len failed header rlen = % (will set dead) May 12 05:27:46 cali020 ceph-49a987b4-2b40-11f0-aadb-b49691cee574-nfs-nfsganesha-0-2-cali020-gqxkvj[2401871]: 12/05/2025 05:27:46 : epoch 68217efc : cali020 : ganesha.nfsd-2[svc_68] rpc :TIRPC :EVENT :handle_haproxy_header: 0x7fe208005890 fd 86 proxy header rest len failed header rlen = % (will set dead) May 12 05:27:48 cali020 ceph-49a987b4-2b40-11f0-aadb-b49691cee574-nfs-nfsganesha-0-2-cali020-gqxkvj[2401871]: 12/05/2025 05:27:48 : epoch 68217efc : cali020 : ganesha.nfsd-2[svc_117] rpc :TIRPC :EVENT :handle_haproxy_header: 0x7fe21400c870 fd 86 proxy header rest len failed header rlen = % (will set dead) May 12 05:27:50 cali020 ceph-49a987b4-2b40-11f0-aadb-b49691cee574-nfs-nfsganesha-0-2-cali020-gqxkvj[2401871]: 12/05/2025 05:27:50 : epoch 68217efc : cali020 : ganesha.nfsd-2[svc_117] rpc :TIRPC :EVENT :handle_haproxy_header: 0x7fe210005820 fd 86 proxy header rest len failed header rlen = % (will set dead) Cali016 -- ganesha.conf (before reboot) ---- [root@cali016 coredump]# cat /var/lib/ceph/49a987b4-2b40-11f0-aadb-b49691cee574/nfs.nfsganesha.0.0.cali016.lmoeqa/etc/ganesha/ganesha.conf # This file is generated by cephadm. NFS_CORE_PARAM { Enable_NLM = false; Enable_RQUOTA = false; Protocols = 3, 4; mount_path_pseudo = true; Enable_UDP = false; NFS_Port = 12049; allow_set_io_flusher_fail = true; HAProxy_Hosts = 10.8.130.13, 2620:52:0:880:b696:91ff:fece:e574, 10.8.130.15, 2620:52:0:880:b696:91ff:fecd:e0dc, 10.8.130.16, 2620:52:0:880:b696:91ff:fece:e6b8, 10.8.130.20, 2620:52:0:880:b696:91ff:fecd:e0a0, 10.8.130.19, 2620:52:0:880:b696:91ff:fece:e7d4, 10.8.130.51; Monitoring_Port = 9587; } NFSv4 { Delegations = false; RecoveryBackend = "rados_cluster"; Minor_Versions = 1, 2; Server_Scope = "49a987b4-2b40-11f0-aadb-b49691cee574-nfsganesha"; IdmapConf = "/etc/ganesha/idmap.conf"; Virtual_Server = false; } RADOS_KV { UserId = "nfs.nfsganesha.0.0.cali016.lmoeqa"; nodeid = 0; pool = ".nfs"; namespace = "nfsganesha"; } RADOS_URLS { UserId = "nfs.nfsganesha.0.0.cali016.lmoeqa"; watch_url = "rados://.nfs/nfsganesha/conf-nfs.nfsganesha"; } RGW { cluster = "ceph"; name = "client.nfs.nfsganesha.0.0.cali016.lmoeqa-rgw"; } %url rados://.nfs/nfsganesha/conf-nfs.nfsganesha ------ Cali020 - Ganesha.conf (Post failover) # cat ganesha.conf # This file is generated by cephadm. NFS_CORE_PARAM { Enable_NLM = false; Enable_RQUOTA = false; Protocols = 3, 4; mount_path_pseudo = true; Enable_UDP = false; NFS_Port = 12049; allow_set_io_flusher_fail = true; HAProxy_Hosts = 10.8.130.13, 2620:52:0:880:b696:91ff:fece:e574, 10.8.130.15, 2620:52:0:880:b696:91ff:fecd:e0dc, 10.8.130.16, 2620:52:0:880:b696:91ff:fece:e6b8, 10.8.130.20, 2620:52:0:880:b696:91ff:fecd:e0a0, 10.8.130.19, 2620:52:0:880:b696:91ff:fece:e7d4, 10.8.130.51; Monitoring_Port = 9587; } NFSv4 { Delegations = false; RecoveryBackend = "rados_cluster"; Minor_Versions = 1, 2; Server_Scope = "49a987b4-2b40-11f0-aadb-b49691cee574-nfsganesha"; IdmapConf = "/etc/ganesha/idmap.conf"; Virtual_Server = false; } RADOS_KV { UserId = "nfs.nfsganesha.0.2.cali020.gqxkvj"; nodeid = 0; pool = ".nfs"; namespace = "nfsganesha"; } RADOS_URLS { UserId = "nfs.nfsganesha.0.2.cali020.gqxkvj"; watch_url = "rados://.nfs/nfsganesha/conf-nfs.nfsganesha"; } RGW { cluster = "ceph"; name = "client.nfs.nfsganesha.0.2.cali020.gqxkvj-rgw"; }