Description of problem: In one of the attempts of ganesha failback, failback process failed Version-Release number of selected component (if applicable): glusterfs-3.7.1-3.el6rhs.x86_64 nfs-ganesha-2.2.0-3.el6rhs.x86_64 How reproducible: Seen on one of the ganesha setup Steps to Reproduce: 1. Setup the ganesha cluster 2. Kill the ganesha process on one of the node 3. Failover happens successfully 4. Now start the nfs-ganesha process on that node 5. Failback dint happen root@nfs2 ~]# pcs status Cluster name: G1434073180.8 Last updated: Thu Jun 25 00:48:51 2015 Last change: Thu Jun 25 00:44:37 2015 Stack: cman Current DC: nfs1 - partition with quorum Version: 1.1.11-97629de 4 Nodes configured 17 Resources configured Online: [ nfs1 nfs2 nfs3 nfs4 ] Full list of resources: Clone Set: nfs-mon-clone [nfs-mon] nfs-mon (ocf::heartbeat:ganesha_mon): FAILED nfs2 Started: [ nfs1 nfs3 nfs4 ] Clone Set: nfs-grace-clone [nfs-grace] Started: [ nfs1 nfs2 nfs3 nfs4 ] nfs2-cluster_ip-1 (ocf::heartbeat:IPaddr): Started nfs4 nfs2-trigger_ip-1 (ocf::heartbeat:Dummy): Started nfs4 nfs1-cluster_ip-1 (ocf::heartbeat:IPaddr): Started nfs1 nfs1-trigger_ip-1 (ocf::heartbeat:Dummy): Started nfs1 nfs3-cluster_ip-1 (ocf::heartbeat:IPaddr): Started nfs3 nfs3-trigger_ip-1 (ocf::heartbeat:Dummy): Started nfs3 nfs4-cluster_ip-1 (ocf::heartbeat:IPaddr): Started nfs4 nfs4-trigger_ip-1 (ocf::heartbeat:Dummy): Started nfs4 nfs2-dead_ip-1 (ocf::heartbeat:Dummy): Started nfs2 Failed actions: nfs-mon_monitor_10000 on nfs2 'unknown error' (1): call=16, status=Timed Out, last-rc-change='Thu Jun 25 00:44:45 2015', queued=0ms, exec=0ms Actual results: Failback dint happen Expected results: Failback should happen successfully Additional info: /var/log/ganesha.log: 25/06/2015 00:44:30 : epoch 558b0196 : nfs2 : ganesha.nfsd-9489[main] nfs_start :NFS STARTUP :EVENT : NFS SERVER INITIALIZED 25/06/2015 00:44:30 : epoch 558b0196 : nfs2 : ganesha.nfsd-9489[main] nfs_start :NFS STARTUP :EVENT :------------------------------------------------- 25/06/2015 00:45:30 : epoch 558b0196 : nfs2 : ganesha.nfsd-9489[reaper] nfs_in_grace :STATE :EVENT :NFS Server Now NOT IN GRACE 25/06/2015 00:45:30 : epoch 558b0196 : nfs2 : ganesha.nfsd-9489[reaper] nfs4_clean_old_recov_dir :CLIENT ID :EVENT :Failed to open old v4 recovery dir (/var/lib/nfs/ganesha/v4old), errno=2
Created attachment 1042958 [details] ganesha.log of the failed node
sosreports of all 4 nodes : http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/1235536/
After couple of minutes pcs status shows stopped for all the nodes: [root@nfs1 ~]# pcs status Cluster name: G1434073180.8 Last updated: Thu Jun 25 02:01:48 2015 Last change: Wed Jun 24 06:17:01 2015 Stack: cman Current DC: nfs1 - partition WITHOUT quorum Version: 1.1.11-97629de 4 Nodes configured 17 Resources configured Online: [ nfs1 ] OFFLINE: [ nfs2 nfs3 nfs4 ] Full list of resources: Clone Set: nfs-mon-clone [nfs-mon] Stopped: [ nfs1 nfs2 nfs3 nfs4 ] Clone Set: nfs-grace-clone [nfs-grace] Stopped: [ nfs1 nfs2 nfs3 nfs4 ] nfs2-cluster_ip-1 (ocf::heartbeat:IPaddr): Stopped nfs2-trigger_ip-1 (ocf::heartbeat:Dummy): Stopped nfs1-cluster_ip-1 (ocf::heartbeat:IPaddr): Stopped nfs1-trigger_ip-1 (ocf::heartbeat:Dummy): Stopped nfs3-cluster_ip-1 (ocf::heartbeat:IPaddr): Stopped nfs3-trigger_ip-1 (ocf::heartbeat:Dummy): Stopped nfs4-cluster_ip-1 (ocf::heartbeat:IPaddr): Stopped nfs4-trigger_ip-1 (ocf::heartbeat:Dummy): Stopped nfs2-dead_ip-1 (ocf::heartbeat:Dummy): Stopped
'ganesha-ha.script' has been in modified state and there are couple of steps which got commented out in it. Requested Apeksha to update the RPMs and recheck the issue.
This has always been working for Saurabh and the developers. I request Apeksha to check this again and update the bug.
(In reply to Meghana from comment #7) > This has always been working for Saurabh and the developers. I request > Apeksha to check this again and update the bug. Meghana, In this case, after Apeksha updates ( about whether the issue is reproducible ), this bug should be closed as CLOSED - WORKSFORME Its not appropriate to move the bug ON_QA. Moving the bug ON_QA, is valid only when there was a issue, and that issue was fixed with the patch, and the patch was made available in the build ( as mentioned in FIXED-IN-VERSION field ) Hope that helps
Thanks. I'll wait for Apeksha's updates and do that accordingly.
Dint hit this issue again, hence closing it