Description of problem: failover is not working with latest glusterfs and ganesha builds. Version-Release number of selected component (if applicable): glusterfs-3.7.9-4 and nfs-ganesha-2.3.1-4 How reproducible: Always Steps to Reproduce: 1.Create a 4 node cluster and setup ganesha on it. 2.Mount the volume on the client and start some IO. 3.Kill nfs-ganesha service on the volume mounted node. 4.Observe that the IP doesn't failover and pcs status shows the same state as before. 5.IO on the mount point hangs and following blocked dd traces are seen in /var/log/messages on client. May 13 04:35:08 dhcp37-206 kernel: nfs: server 10.70.40.205 not responding, still trying May 13 04:38:28 dhcp37-206 kernel: INFO: task dd:28575 blocked for more than 120 seconds. May 13 04:38:28 dhcp37-206 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. May 13 04:38:28 dhcp37-206 kernel: dd D ffff8800d404bd50 0 28575 28540 0x00000080 May 13 04:38:28 dhcp37-206 kernel: ffff8800d404bbf0 0000000000000082 ffff880210b5a280 ffff8800d404bfd8 May 13 04:38:28 dhcp37-206 kernel: ffff8800d404bfd8 ffff8800d404bfd8 ffff880210b5a280 ffff88021fd94780 May 13 04:38:28 dhcp37-206 kernel: 0000000000000000 7fffffffffffffff ffffffff811688b0 ffff8800d404bd50 May 13 04:38:28 dhcp37-206 kernel: Call Trace: May 13 04:38:28 dhcp37-206 kernel: [<ffffffff811688b0>] ? wait_on_page_read+0x60/0x60 May 13 04:38:28 dhcp37-206 kernel: [<ffffffff8163a909>] schedule+0x29/0x70 May 13 04:38:28 dhcp37-206 kernel: [<ffffffff816385f9>] schedule_timeout+0x209/0x2d0 May 13 04:38:28 dhcp37-206 kernel: [<ffffffff81058aaf>] ? kvm_clock_get_cycles+0x1f/0x30 May 13 04:38:28 dhcp37-206 kernel: [<ffffffff811688b0>] ? wait_on_page_read+0x60/0x60 May 13 04:38:28 dhcp37-206 kernel: [<ffffffff81639f3e>] io_schedule_timeout+0xae/0x130 May 13 04:38:28 dhcp37-206 kernel: [<ffffffff81639fd8>] io_schedule+0x18/0x20 May 13 04:38:28 dhcp37-206 kernel: [<ffffffff811688be>] sleep_on_page+0xe/0x20 May 13 04:38:28 dhcp37-206 kernel: [<ffffffff81638780>] __wait_on_bit+0x60/0x90 May 13 04:38:28 dhcp37-206 kernel: [<ffffffff81168646>] wait_on_page_bit+0x86/0xb0 May 13 04:38:28 dhcp37-206 kernel: [<ffffffff810a6b60>] ? wake_atomic_t_function+0x40/0x40 May 13 04:38:28 dhcp37-206 kernel: [<ffffffff81168781>] filemap_fdatawait_range+0x111/0x1b0 May 13 04:38:28 dhcp37-206 kernel: [<ffffffff8116a7af>] filemap_write_and_wait_range+0x3f/0x70 May 13 04:38:28 dhcp37-206 kernel: [<ffffffffa05331ef>] nfs4_file_fsync+0x5f/0xa0 [nfsv4] May 13 04:38:28 dhcp37-206 kernel: [<ffffffff8120f77b>] vfs_fsync+0x2b/0x40 May 13 04:38:28 dhcp37-206 kernel: [<ffffffffa04b5f0a>] nfs_file_flush+0x7a/0xb0 [nfs] May 13 04:38:28 dhcp37-206 kernel: [<ffffffff811dc254>] filp_close+0x34/0x80 May 13 04:38:28 dhcp37-206 kernel: [<ffffffff811fcb98>] __close_fd+0x78/0xa0 May 13 04:38:28 dhcp37-206 kernel: [<ffffffff811dd963>] SyS_close+0x23/0x50 May 13 04:38:28 dhcp37-206 kernel: [<ffffffff81645909>] system_call_fastpath+0x16/0x1b May 13 04:40:28 dhcp37-206 kernel: INFO: task dd:28575 blocked for more than 120 seconds. May 13 04:40:28 dhcp37-206 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 6. I can see below messages in /var/log/messages on server side: May 13 20:59:26 dhcp42-20 ganesha_mon(nfs-mon)[22159]: INFO: warning: crm_attribute --node=dhcp42-20 --lifetime=forever --name=grace-active --update=1 failed May 13 20:59:36 dhcp42-20 ganesha_mon(nfs-mon)[22261]: INFO: warning: crm_attribute --node=dhcp42-20 --lifetime=forever --name=grace-active --update=1 failed Can this be the reason for the failover to not work?? 7. Once i restart ganesha service on that node, IO starts. Actual results: failover is not working with latest builds. Expected results: failover should work properly Additional info: This is observed while running automated suite on the latest builds. The same tests were working fine with 3.7.9-1 build
Verified this bug with latest glusterfs-3.7.9-5 and nfs-ganesha-2.3.1-7 builds and the issue is not reproducible now. In case, ganesha service goes down on the node from where IO is running. it fails over to the other node and IO doesn't get effected in any way. Also, ran HA automated cases on the mentioned builds and its working fine. Based on the above observation, marking this bug as Verified.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2016:1240