Description of problem: while running ganesha ha cases, vips got lost Version-Release number of selected component (if applicable): nfs-ganesha-2.2.0-12.el7rhgs.x86_64 glusterfs-3.7.5-16.el7rhgs.x86_64 How reproducible: Not always Steps to Reproduce: 1. Run Ganesha HA automated cases 2. After a node reboot, failover happened 3. IO Hanged, at the server VIPs were lost Pacemaker restart on all nodes, got back the vips and IO continued. Actual results: Expected results: Additional info:
While debugging we haven't seen any errors logged on why VIPs have got lost. As Apeksha mentioned, pacemaker service restart bought the services back. Since it is hit only once, keep this bug at low priority for now and may be it will be good to document it as known issue.
Hit this issue again on the new build - glusterfs-3.7.5-17.el7rhgs.x86_64.
We have seen below logs when the issue happened - Jan 21 22:26:57 vm4 crmd[28274]: notice: Operation vm1-cluster_ip-1_start_0: ok (node=vm4, call=60, rc=0, cib-update=37, confirmed=true) Jan 21 22:26:57 vm4 crmd[28274]: notice: Operation vm4-cluster_ip-1_start_0: ok (node=vm4, call=61, rc=0, cib-update=38, confirmed=true) Jan 21 22:27:17 vm4 lrmd[28271]: warning: nfs-grace_monitor_5000 process (PID 30924) timed out Jan 21 22:27:17 vm4 lrmd[28271]: warning: nfs-grace_monitor_5000:30924 - timed out after 20000ms Jan 21 22:27:17 vm4 crmd[28274]: error: Operation nfs-grace_monitor_5000: Timed Out (node=vm4, call=59, timeout=20000ms) Jan 21 22:27:17 vm4 IPaddr(vm4-cluster_ip-1)[31319]: INFO: IP status = ok, IP_CIP= Jan 21 22:27:17 vm4 IPaddr(vm1-cluster_ip-1)[31318]: INFO: IP status = ok, IP_CIP= Jan 21 22:27:17 vm4 crmd[28274]: notice: Operation vm4-cluster_ip-1_stop_0: ok (node=vm4, call=67, rc=0, cib-update=42, confirmed=true) Jan 21 22:27:17 vm4 crmd[28274]: notice: Operation vm1-cluster_ip-1_stop_0: ok (node=vm4, call=65, rc=0, cib-update=43, confirmed=true) Maybe stopping the cluster_ip resources went into stopped state on all the nodes resulting in loss of VIPs. Not sure what could have triggered that.
Sorry for the type above. What I meant is that 'all the cluster_ip resources being in stopped state has resulted in VIPs getting lost.' One of the reasons could be that ganesha_active attribute may have got unset.
Might have been fixed with the recent changes that Kaleb posted. Kaleb, can you point us to a possible downstream patch? QE can then test again with a version that contains a fix.
closing, has not been seen during 3.1.3 testing