Bug 1299858 - While running ganesha ha cases IO hanged, VIPs got lost
While running ganesha ha cases IO hanged, VIPs got lost
Product: Red Hat Gluster Storage
Classification: Red Hat
Component: gluster-nfs (Show other bugs)
x86_64 Linux
low Severity medium
: ---
: ---
Assigned To: Kaleb KEITHLEY
: ZStream
Depends On:
  Show dependency treegraph
Reported: 2016-01-19 07:29 EST by Apeksha
Modified: 2016-08-19 05:15 EDT (History)
7 users (show)

See Also:
Fixed In Version: 3.1.3
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Last Closed: 2016-06-06 07:14:55 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Attachments (Terms of Use)

  None (edit)
Description Apeksha 2016-01-19 07:29:50 EST
Description of problem:
while running ganesha ha cases, vips got lost

Version-Release number of selected component (if applicable):

How reproducible:
Not always

Steps to Reproduce:
1. Run Ganesha HA automated cases
2. After a node reboot, failover happened
3. IO Hanged, at the server VIPs were lost

Pacemaker restart on all nodes, got back the vips and IO continued.

Actual results:

Expected results:

Additional info:
Comment 2 Soumya Koduri 2016-01-19 07:34:39 EST
While debugging we haven't seen any errors logged on why VIPs have got lost. As Apeksha mentioned, pacemaker service restart bought the services back. Since it is hit only once, keep this bug at low priority for now and may be it will be good to document it as known issue.
Comment 3 Apeksha 2016-01-21 05:22:30 EST
Hit this issue again on the new build - glusterfs-3.7.5-17.el7rhgs.x86_64.
Comment 4 Soumya Koduri 2016-01-21 05:57:15 EST
We have seen below logs when the issue happened - 

Jan 21 22:26:57 vm4 crmd[28274]:  notice: Operation vm1-cluster_ip-1_start_0: ok (node=vm4, call=60, rc=0, cib-update=37, confirmed=true)
Jan 21 22:26:57 vm4 crmd[28274]:  notice: Operation vm4-cluster_ip-1_start_0: ok (node=vm4, call=61, rc=0, cib-update=38, confirmed=true)
Jan 21 22:27:17 vm4 lrmd[28271]: warning: nfs-grace_monitor_5000 process (PID 30924) timed out
Jan 21 22:27:17 vm4 lrmd[28271]: warning: nfs-grace_monitor_5000:30924 - timed out after 20000ms
Jan 21 22:27:17 vm4 crmd[28274]:   error: Operation nfs-grace_monitor_5000: Timed Out (node=vm4, call=59, timeout=20000ms)
Jan 21 22:27:17 vm4 IPaddr(vm4-cluster_ip-1)[31319]: INFO: IP status = ok, IP_CIP=
Jan 21 22:27:17 vm4 IPaddr(vm1-cluster_ip-1)[31318]: INFO: IP status = ok, IP_CIP=
Jan 21 22:27:17 vm4 crmd[28274]:  notice: Operation vm4-cluster_ip-1_stop_0: ok (node=vm4, call=67, rc=0, cib-update=42, confirmed=true)
Jan 21 22:27:17 vm4 crmd[28274]:  notice: Operation vm1-cluster_ip-1_stop_0: ok (node=vm4, call=65, rc=0, cib-update=43, confirmed=true)

Maybe stopping the cluster_ip resources went into stopped state on all the nodes resulting in loss of VIPs. Not sure what could have triggered that.
Comment 5 Soumya Koduri 2016-01-21 06:03:51 EST
Sorry for the type above. What I meant is that 'all the cluster_ip resources being in stopped state has resulted in VIPs getting lost.'
One of the reasons could be that ganesha_active attribute may have got unset.
Comment 7 Niels de Vos 2016-03-07 06:02:40 EST
Might have been fixed with the recent changes that Kaleb posted.

Kaleb, can you point us to a possible downstream patch? QE can then test again with a version that contains a fix.
Comment 8 Kaleb KEITHLEY 2016-06-06 07:14:55 EDT
closing, has not been seen during 3.1.3 testing

Note You need to log in before you can comment on or make changes to this bug.