Bug 1335826

Summary:	failover is not working with latest builds.
Product:	[Red Hat Storage] Red Hat Gluster Storage	Reporter:	Shashank Raj <sraj>
Component:	gluster-nfs	Assignee:	Kaleb KEITHLEY <kkeithle>
Status:	CLOSED ERRATA	QA Contact:	Shashank Raj <sraj>
Severity:	urgent	Docs Contact:
Priority:	unspecified
Version:	rhgs-3.1	CC:	amukherj, asrivast, jthottan, kkeithle, ndevos, rhinduja, rhs-bugs, sashinde, skoduri, storage-qa-internal
Target Milestone:	---	Keywords:	Regression, ZStream
Target Release:	RHGS 3.1.3
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:	glusterfs-3.7.9-5	Doc Type:	No Doc Update
Doc Text:	undefined	Story Points:	---
Clone Of:
Clones:	1336197 (view as bug list)		Environment:
Last Closed:	2016-06-23 05:23:17 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1311817, 1336197, 1336198, 1336199

Description Shashank Raj 2016-05-13 10:13:35 UTC

Description of problem:
failover is not working with latest glusterfs and ganesha builds.

Version-Release number of selected component (if applicable):
glusterfs-3.7.9-4 and nfs-ganesha-2.3.1-4

How reproducible:
Always

Steps to Reproduce:
1.Create a 4 node cluster and setup ganesha on it.
2.Mount the volume on the client and start some IO.
3.Kill nfs-ganesha service on the volume mounted node.
4.Observe that the IP doesn't failover and pcs status shows the same state as before.
5.IO on the mount point hangs and following blocked dd traces are seen in /var/log/messages on client.

May 13 04:35:08 dhcp37-206 kernel: nfs: server 10.70.40.205 not responding, still trying
May 13 04:38:28 dhcp37-206 kernel: INFO: task dd:28575 blocked for more than 120 seconds.
May 13 04:38:28 dhcp37-206 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
May 13 04:38:28 dhcp37-206 kernel: dd              D ffff8800d404bd50     0 28575  28540 0x00000080
May 13 04:38:28 dhcp37-206 kernel: ffff8800d404bbf0 0000000000000082 ffff880210b5a280 ffff8800d404bfd8
May 13 04:38:28 dhcp37-206 kernel: ffff8800d404bfd8 ffff8800d404bfd8 ffff880210b5a280 ffff88021fd94780
May 13 04:38:28 dhcp37-206 kernel: 0000000000000000 7fffffffffffffff ffffffff811688b0 ffff8800d404bd50
May 13 04:38:28 dhcp37-206 kernel: Call Trace:
May 13 04:38:28 dhcp37-206 kernel: [<ffffffff811688b0>] ? wait_on_page_read+0x60/0x60
May 13 04:38:28 dhcp37-206 kernel: [<ffffffff8163a909>] schedule+0x29/0x70
May 13 04:38:28 dhcp37-206 kernel: [<ffffffff816385f9>] schedule_timeout+0x209/0x2d0
May 13 04:38:28 dhcp37-206 kernel: [<ffffffff81058aaf>] ? kvm_clock_get_cycles+0x1f/0x30
May 13 04:38:28 dhcp37-206 kernel: [<ffffffff811688b0>] ? wait_on_page_read+0x60/0x60
May 13 04:38:28 dhcp37-206 kernel: [<ffffffff81639f3e>] io_schedule_timeout+0xae/0x130
May 13 04:38:28 dhcp37-206 kernel: [<ffffffff81639fd8>] io_schedule+0x18/0x20
May 13 04:38:28 dhcp37-206 kernel: [<ffffffff811688be>] sleep_on_page+0xe/0x20
May 13 04:38:28 dhcp37-206 kernel: [<ffffffff81638780>] __wait_on_bit+0x60/0x90
May 13 04:38:28 dhcp37-206 kernel: [<ffffffff81168646>] wait_on_page_bit+0x86/0xb0
May 13 04:38:28 dhcp37-206 kernel: [<ffffffff810a6b60>] ? wake_atomic_t_function+0x40/0x40
May 13 04:38:28 dhcp37-206 kernel: [<ffffffff81168781>] filemap_fdatawait_range+0x111/0x1b0
May 13 04:38:28 dhcp37-206 kernel: [<ffffffff8116a7af>] filemap_write_and_wait_range+0x3f/0x70
May 13 04:38:28 dhcp37-206 kernel: [<ffffffffa05331ef>] nfs4_file_fsync+0x5f/0xa0 [nfsv4]
May 13 04:38:28 dhcp37-206 kernel: [<ffffffff8120f77b>] vfs_fsync+0x2b/0x40
May 13 04:38:28 dhcp37-206 kernel: [<ffffffffa04b5f0a>] nfs_file_flush+0x7a/0xb0 [nfs]
May 13 04:38:28 dhcp37-206 kernel: [<ffffffff811dc254>] filp_close+0x34/0x80
May 13 04:38:28 dhcp37-206 kernel: [<ffffffff811fcb98>] __close_fd+0x78/0xa0
May 13 04:38:28 dhcp37-206 kernel: [<ffffffff811dd963>] SyS_close+0x23/0x50
May 13 04:38:28 dhcp37-206 kernel: [<ffffffff81645909>] system_call_fastpath+0x16/0x1b
May 13 04:40:28 dhcp37-206 kernel: INFO: task dd:28575 blocked for more than 120 seconds.
May 13 04:40:28 dhcp37-206 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

6. I can see below messages in /var/log/messages on server side:

May 13 20:59:26 dhcp42-20 ganesha_mon(nfs-mon)[22159]: INFO: warning: crm_attribute --node=dhcp42-20 --lifetime=forever --name=grace-active --update=1 failed

May 13 20:59:36 dhcp42-20 ganesha_mon(nfs-mon)[22261]: INFO: warning: crm_attribute --node=dhcp42-20 --lifetime=forever --name=grace-active --update=1 failed

Can this be the reason for the failover to not work??

7. Once i restart ganesha service on that node, IO starts.

Actual results:

failover is not working with latest builds.

Expected results:

failover should work properly

Additional info:

This is observed while running automated suite on the latest builds. The same tests were working fine with 3.7.9-1 build

Comment 6 Shashank Raj 2016-05-18 12:22:56 UTC

Verified this bug with latest glusterfs-3.7.9-5 and nfs-ganesha-2.3.1-7 builds and the issue is not reproducible now.

In case, ganesha service goes down on the node from where IO is running. it fails over to the other node and IO doesn't get effected in any way.

Also, ran HA automated cases on the mentioned builds and its working fine.

Based on the above observation, marking this bug as Verified.

Comment 8 errata-xmlrpc 2016-06-23 05:23:17 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2016:1240