Bug 1299858

Summary: While running ganesha ha cases IO hanged, VIPs got lost
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Apeksha <akhakhar>
Component: gluster-nfsAssignee: Kaleb KEITHLEY <kkeithle>
Status: CLOSED NEXTRELEASE QA Contact: Saurabh <saujain>
Severity: medium Docs Contact:
Priority: low    
Version: rhgs-3.1CC: jthottan, kkeithle, ndevos, nlevinki, rhs-bugs, skoduri, storage-qa-internal
Target Milestone: ---Keywords: ZStream
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: 3.1.3 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-06-06 11:14:55 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Apeksha 2016-01-19 12:29:50 UTC
Description of problem:
while running ganesha ha cases, vips got lost

Version-Release number of selected component (if applicable):
nfs-ganesha-2.2.0-12.el7rhgs.x86_64
glusterfs-3.7.5-16.el7rhgs.x86_64

How reproducible:
Not always

Steps to Reproduce:
1. Run Ganesha HA automated cases
2. After a node reboot, failover happened
3. IO Hanged, at the server VIPs were lost

Pacemaker restart on all nodes, got back the vips and IO continued.


Actual results:


Expected results:


Additional info:

Comment 2 Soumya Koduri 2016-01-19 12:34:39 UTC
While debugging we haven't seen any errors logged on why VIPs have got lost. As Apeksha mentioned, pacemaker service restart bought the services back. Since it is hit only once, keep this bug at low priority for now and may be it will be good to document it as known issue.

Comment 3 Apeksha 2016-01-21 10:22:30 UTC
Hit this issue again on the new build - glusterfs-3.7.5-17.el7rhgs.x86_64.

Comment 4 Soumya Koduri 2016-01-21 10:57:15 UTC
We have seen below logs when the issue happened - 

Jan 21 22:26:57 vm4 crmd[28274]:  notice: Operation vm1-cluster_ip-1_start_0: ok (node=vm4, call=60, rc=0, cib-update=37, confirmed=true)
Jan 21 22:26:57 vm4 crmd[28274]:  notice: Operation vm4-cluster_ip-1_start_0: ok (node=vm4, call=61, rc=0, cib-update=38, confirmed=true)
Jan 21 22:27:17 vm4 lrmd[28271]: warning: nfs-grace_monitor_5000 process (PID 30924) timed out
Jan 21 22:27:17 vm4 lrmd[28271]: warning: nfs-grace_monitor_5000:30924 - timed out after 20000ms
Jan 21 22:27:17 vm4 crmd[28274]:   error: Operation nfs-grace_monitor_5000: Timed Out (node=vm4, call=59, timeout=20000ms)
Jan 21 22:27:17 vm4 IPaddr(vm4-cluster_ip-1)[31319]: INFO: IP status = ok, IP_CIP=
Jan 21 22:27:17 vm4 IPaddr(vm1-cluster_ip-1)[31318]: INFO: IP status = ok, IP_CIP=
Jan 21 22:27:17 vm4 crmd[28274]:  notice: Operation vm4-cluster_ip-1_stop_0: ok (node=vm4, call=67, rc=0, cib-update=42, confirmed=true)
Jan 21 22:27:17 vm4 crmd[28274]:  notice: Operation vm1-cluster_ip-1_stop_0: ok (node=vm4, call=65, rc=0, cib-update=43, confirmed=true)


Maybe stopping the cluster_ip resources went into stopped state on all the nodes resulting in loss of VIPs. Not sure what could have triggered that.

Comment 5 Soumya Koduri 2016-01-21 11:03:51 UTC
Sorry for the type above. What I meant is that 'all the cluster_ip resources being in stopped state has resulted in VIPs getting lost.'
One of the reasons could be that ganesha_active attribute may have got unset.

Comment 7 Niels de Vos 2016-03-07 11:02:40 UTC
Might have been fixed with the recent changes that Kaleb posted.

Kaleb, can you point us to a possible downstream patch? QE can then test again with a version that contains a fix.

Comment 8 Kaleb KEITHLEY 2016-06-06 11:14:55 UTC
closing, has not been seen during 3.1.3 testing