1299858 – While running ganesha ha cases IO hanged, VIPs got lost

Bug 1299858 - While running ganesha ha cases IO hanged, VIPs got lost

Summary: While running ganesha ha cases IO hanged, VIPs got lost

Keywords:
Status:	CLOSED NEXTRELEASE
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	gluster-nfs
Sub Component:
Version:	rhgs-3.1
Hardware:	x86_64
OS:	Linux
Priority:	low
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Kaleb KEITHLEY
QA Contact:	Saurabh
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2016-01-19 12:29 UTC by Apeksha
Modified:	2016-08-19 09:15 UTC (History)
CC List:	7 users (show)
Fixed In Version:	3.1.3
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2016-06-06 11:14:55 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Apeksha 2016-01-19 12:29:50 UTC

Description of problem:
while running ganesha ha cases, vips got lost

Version-Release number of selected component (if applicable):
nfs-ganesha-2.2.0-12.el7rhgs.x86_64
glusterfs-3.7.5-16.el7rhgs.x86_64

How reproducible:
Not always

Steps to Reproduce:
1. Run Ganesha HA automated cases
2. After a node reboot, failover happened
3. IO Hanged, at the server VIPs were lost

Pacemaker restart on all nodes, got back the vips and IO continued.


Actual results:


Expected results:


Additional info:

Comment 2 Soumya Koduri 2016-01-19 12:34:39 UTC

While debugging we haven't seen any errors logged on why VIPs have got lost. As Apeksha mentioned, pacemaker service restart bought the services back. Since it is hit only once, keep this bug at low priority for now and may be it will be good to document it as known issue.

Comment 3 Apeksha 2016-01-21 10:22:30 UTC

Hit this issue again on the new build - glusterfs-3.7.5-17.el7rhgs.x86_64.

Comment 4 Soumya Koduri 2016-01-21 10:57:15 UTC

We have seen below logs when the issue happened - 

Jan 21 22:26:57 vm4 crmd[28274]:  notice: Operation vm1-cluster_ip-1_start_0: ok (node=vm4, call=60, rc=0, cib-update=37, confirmed=true)
Jan 21 22:26:57 vm4 crmd[28274]:  notice: Operation vm4-cluster_ip-1_start_0: ok (node=vm4, call=61, rc=0, cib-update=38, confirmed=true)
Jan 21 22:27:17 vm4 lrmd[28271]: warning: nfs-grace_monitor_5000 process (PID 30924) timed out
Jan 21 22:27:17 vm4 lrmd[28271]: warning: nfs-grace_monitor_5000:30924 - timed out after 20000ms
Jan 21 22:27:17 vm4 crmd[28274]:   error: Operation nfs-grace_monitor_5000: Timed Out (node=vm4, call=59, timeout=20000ms)
Jan 21 22:27:17 vm4 IPaddr(vm4-cluster_ip-1)[31319]: INFO: IP status = ok, IP_CIP=
Jan 21 22:27:17 vm4 IPaddr(vm1-cluster_ip-1)[31318]: INFO: IP status = ok, IP_CIP=
Jan 21 22:27:17 vm4 crmd[28274]:  notice: Operation vm4-cluster_ip-1_stop_0: ok (node=vm4, call=67, rc=0, cib-update=42, confirmed=true)
Jan 21 22:27:17 vm4 crmd[28274]:  notice: Operation vm1-cluster_ip-1_stop_0: ok (node=vm4, call=65, rc=0, cib-update=43, confirmed=true)


Maybe stopping the cluster_ip resources went into stopped state on all the nodes resulting in loss of VIPs. Not sure what could have triggered that.

Comment 5 Soumya Koduri 2016-01-21 11:03:51 UTC

Sorry for the type above. What I meant is that 'all the cluster_ip resources being in stopped state has resulted in VIPs getting lost.'
One of the reasons could be that ganesha_active attribute may have got unset.

Comment 7 Niels de Vos 2016-03-07 11:02:40 UTC

Might have been fixed with the recent changes that Kaleb posted.

Kaleb, can you point us to a possible downstream patch? QE can then test again with a version that contains a fix.

Comment 8 Kaleb KEITHLEY 2016-06-06 11:14:55 UTC

closing, has not been seen during 3.1.3 testing

Note You need to log in before you can comment on or make changes to this bug.