1335826 – failover is not working with latest builds.

Bug 1335826 - failover is not working with latest builds.

Summary: failover is not working with latest builds.

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	gluster-nfs
Sub Component:
Version:	rhgs-3.1
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	RHGS 3.1.3
Assignee:	Kaleb KEITHLEY
QA Contact:	Shashank Raj
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1311817 1336197 1336198 1336199
TreeView+	depends on / blocked

Reported:	2016-05-13 10:13 UTC by Shashank Raj
Modified:	2016-11-08 03:52 UTC (History)
CC List:	10 users (show)
Fixed In Version:	glusterfs-3.7.9-5
Doc Type:	No Doc Update
Doc Text:	undefined
Clone Of:
Clones:	1336197 (view as bug list)
Environment:
Last Closed:	2016-06-23 05:23:17 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2016:1240	0	normal	SHIPPED_LIVE	Red Hat Gluster Storage 3.1 Update 3	2016-06-23 08:51:28 UTC

Description Shashank Raj 2016-05-13 10:13:35 UTC

Description of problem:
failover is not working with latest glusterfs and ganesha builds.

Version-Release number of selected component (if applicable):
glusterfs-3.7.9-4 and nfs-ganesha-2.3.1-4

How reproducible:
Always

Steps to Reproduce:
1.Create a 4 node cluster and setup ganesha on it.
2.Mount the volume on the client and start some IO.
3.Kill nfs-ganesha service on the volume mounted node.
4.Observe that the IP doesn't failover and pcs status shows the same state as before.
5.IO on the mount point hangs and following blocked dd traces are seen in /var/log/messages on client.

May 13 04:35:08 dhcp37-206 kernel: nfs: server 10.70.40.205 not responding, still trying
May 13 04:38:28 dhcp37-206 kernel: INFO: task dd:28575 blocked for more than 120 seconds.
May 13 04:38:28 dhcp37-206 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
May 13 04:38:28 dhcp37-206 kernel: dd              D ffff8800d404bd50     0 28575  28540 0x00000080
May 13 04:38:28 dhcp37-206 kernel: ffff8800d404bbf0 0000000000000082 ffff880210b5a280 ffff8800d404bfd8
May 13 04:38:28 dhcp37-206 kernel: ffff8800d404bfd8 ffff8800d404bfd8 ffff880210b5a280 ffff88021fd94780
May 13 04:38:28 dhcp37-206 kernel: 0000000000000000 7fffffffffffffff ffffffff811688b0 ffff8800d404bd50
May 13 04:38:28 dhcp37-206 kernel: Call Trace:
May 13 04:38:28 dhcp37-206 kernel: [<ffffffff811688b0>] ? wait_on_page_read+0x60/0x60
May 13 04:38:28 dhcp37-206 kernel: [<ffffffff8163a909>] schedule+0x29/0x70
May 13 04:38:28 dhcp37-206 kernel: [<ffffffff816385f9>] schedule_timeout+0x209/0x2d0
May 13 04:38:28 dhcp37-206 kernel: [<ffffffff81058aaf>] ? kvm_clock_get_cycles+0x1f/0x30
May 13 04:38:28 dhcp37-206 kernel: [<ffffffff811688b0>] ? wait_on_page_read+0x60/0x60
May 13 04:38:28 dhcp37-206 kernel: [<ffffffff81639f3e>] io_schedule_timeout+0xae/0x130
May 13 04:38:28 dhcp37-206 kernel: [<ffffffff81639fd8>] io_schedule+0x18/0x20
May 13 04:38:28 dhcp37-206 kernel: [<ffffffff811688be>] sleep_on_page+0xe/0x20
May 13 04:38:28 dhcp37-206 kernel: [<ffffffff81638780>] __wait_on_bit+0x60/0x90
May 13 04:38:28 dhcp37-206 kernel: [<ffffffff81168646>] wait_on_page_bit+0x86/0xb0
May 13 04:38:28 dhcp37-206 kernel: [<ffffffff810a6b60>] ? wake_atomic_t_function+0x40/0x40
May 13 04:38:28 dhcp37-206 kernel: [<ffffffff81168781>] filemap_fdatawait_range+0x111/0x1b0
May 13 04:38:28 dhcp37-206 kernel: [<ffffffff8116a7af>] filemap_write_and_wait_range+0x3f/0x70
May 13 04:38:28 dhcp37-206 kernel: [<ffffffffa05331ef>] nfs4_file_fsync+0x5f/0xa0 [nfsv4]
May 13 04:38:28 dhcp37-206 kernel: [<ffffffff8120f77b>] vfs_fsync+0x2b/0x40
May 13 04:38:28 dhcp37-206 kernel: [<ffffffffa04b5f0a>] nfs_file_flush+0x7a/0xb0 [nfs]
May 13 04:38:28 dhcp37-206 kernel: [<ffffffff811dc254>] filp_close+0x34/0x80
May 13 04:38:28 dhcp37-206 kernel: [<ffffffff811fcb98>] __close_fd+0x78/0xa0
May 13 04:38:28 dhcp37-206 kernel: [<ffffffff811dd963>] SyS_close+0x23/0x50
May 13 04:38:28 dhcp37-206 kernel: [<ffffffff81645909>] system_call_fastpath+0x16/0x1b
May 13 04:40:28 dhcp37-206 kernel: INFO: task dd:28575 blocked for more than 120 seconds.
May 13 04:40:28 dhcp37-206 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

6. I can see below messages in /var/log/messages on server side:

May 13 20:59:26 dhcp42-20 ganesha_mon(nfs-mon)[22159]: INFO: warning: crm_attribute --node=dhcp42-20 --lifetime=forever --name=grace-active --update=1 failed

May 13 20:59:36 dhcp42-20 ganesha_mon(nfs-mon)[22261]: INFO: warning: crm_attribute --node=dhcp42-20 --lifetime=forever --name=grace-active --update=1 failed

Can this be the reason for the failover to not work??

7. Once i restart ganesha service on that node, IO starts.

Actual results:

failover is not working with latest builds.

Expected results:

failover should work properly

Additional info:

This is observed while running automated suite on the latest builds. The same tests were working fine with 3.7.9-1 build

Comment 6 Shashank Raj 2016-05-18 12:22:56 UTC

Verified this bug with latest glusterfs-3.7.9-5 and nfs-ganesha-2.3.1-7 builds and the issue is not reproducible now.

In case, ganesha service goes down on the node from where IO is running. it fails over to the other node and IO doesn't get effected in any way.

Also, ran HA automated cases on the mentioned builds and its working fine.

Based on the above observation, marking this bug as Verified.

Comment 8 errata-xmlrpc 2016-06-23 05:23:17 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2016:1240

Note You need to log in before you can comment on or make changes to this bug.