Description of problem: After having nodes go down and attempted to be brought back into the cluster, the nodes left up have their services stuck in the recovery state. This is very similar to bz 128432 which was closed because perl Net-Telnet was not installed however it was in this case and fencing had been working. The fence service on the shot nodes never gets back up to the run state. morph-01 and morph-06 were the nodes which were shot. [root@morph-01 root]# cat /proc/cluster/services Service Name GID LID State Code Fence Domain: "default" 0 2 join S-1,80,6 [] [root@morph-02 root]# cat /proc/cluster/services Service Name GID LID State Code Fence Domain: "default" 1 2 recover 4 - [4 2 5 3] DLM Lock Space: "clvmd" 2 3 recover 0 - [2 3 4 5] DLM Lock Space: "corey0" 3 4 recover 0 - [4 2 5 3] DLM Lock Space: "corey1" 5 6 recover 0 - [4 2 5 3] DLM Lock Space: "corey2" 7 8 recover 0 - [4 2 5 3] GFS Mount Group: "corey0" 4 5 recover 0 - [4 2 5 3] GFS Mount Group: "corey1" 6 7 recover 0 - [4 2 5 3] GFS Mount Group: "corey2" 8 9 recover 0 - [4 2 5 3] root@morph-03 root]# cat /proc/cluster/services Service Name GID LID State Code Fence Domain: "default" 1 2 recover 2 - [2 4 5 3] DLM Lock Space: "clvmd" 2 3 recover 0 - [2 3 4 5] DLM Lock Space: "corey0" 3 4 recover 0 - [2 4 5 3] DLM Lock Space: "corey1" 5 6 recover 0 - [2 4 5 3] DLM Lock Space: "corey2" 7 8 recover 0 - [2 4 5 3] GFS Mount Group: "corey0" 4 5 recover 0 - [2 4 5 3] GFS Mount Group: "corey1" 6 7 recover 0 - [2 4 5 3] GFS Mount Group: "corey2" 8 9 recover 0 - [2 4 5 3] [root@morph-04 root]# cat /proc/cluster/services Service Name GID LID State Code Fence Domain: "default" 1 2 recover 4 - [2 4 5 3] DLM Lock Space: "clvmd" 2 3 recover 0 - [2 3 4 5] DLM Lock Space: "corey0" 3 4 recover 0 - [2 4 5 3] DLM Lock Space: "corey1" 5 6 recover 0 - [2 4 5 3] DLM Lock Space: "corey2" 7 8 recover 0 - [2 4 5 3] GFS Mount Group: "corey0" 4 5 recover 0 - [2 4 5 3] GFS Mount Group: "corey1" 6 7 recover 0 - [2 4 5 3] GFS Mount Group: "corey2" 8 9 recover 0 - [2 4 5 3] [root@morph-05 root]# cat /proc/cluster/services Service Name GID LID State Code Fence Domain: "default" 1 2 recover 4 - [2 3 4 5] DLM Lock Space: "clvmd" 2 3 recover 0 - [2 3 4 5] DLM Lock Space: "corey0" 3 4 recover 0 - [2 3 4 5] DLM Lock Space: "corey1" 5 6 recover 0 - [2 3 4 5] DLM Lock Space: "corey2" 7 8 recover 0 - [2 3 4 5] GFS Mount Group: "corey0" 4 5 recover 0 - [2 3 4 5] GFS Mount Group: "corey1" 6 7 recover 0 - [2 3 4 5] GFS Mount Group: "corey2" 8 9 recover 0 - [2 3 4 5] [root@morph-06 root]# cat /proc/cluster/services Service Name GID LID State Code Fence Domain: "default" 0 2 join S-1,80,6 [] How reproducible: Sometimes
Assigning this to dave because they're all stuck behind fencing.
morph-03 is the node where fenced is supposed to be doing fencing (recover state 2) but for some reason hasn't finished. You need to look at fenced on morph-03 to see what it's doing and where it (or the agent it called) is stuck. Nothing else can begin recovery until fencing is finished, of course.
Updates with the proper version and component name.
reproduced this again this morning. It appears that this time everyone is stuck waiting for morph-05 to fence morph-02 (the node which was taken down and brought back up). So now morph-02 is stuck trying to join back in and everyone else is stuck in recovery waiting for morph-05 to fence morph-02. However there are no messages in the syslog on morph-05 about it attempting any fence operations on anyone.
Without trying I saw this or something similar myself a couple days ago. I tried many things to reproduce it but couldn't. I added some new debugging data we can collect if it happens again. - fenced now adds an extra line to syslog saying "fencing deferred to <nodeid>" when one of the other nodes is supposed to be doing the fencing. Look for this info next time. - /proc/cluster/sm_debug has some additional info -- it should verify where things are getting stuck. Best, of course, would be to run with debugging output captured to a file "fenced -D > fenced.log" (although this may not be practical if you can't reproduce it very reliably.)
This could be related to bz143269, although in that bug _only_ the mount group is stuck in recovery.
Saw this again yesterday. fenced is still sitting in pause() waiting for a signal. SM has started the sg which means the signal should have been sent. Is the signal getting lost somewhere or is fenced missing it, or ... ? I ran "killall -s SIGUSR1 fenced" to deliver the signal manually and fenced correctly woke up, found it needed to fence va04 and forked off fence_manual.
tadpol provided this link which appears to explain this problem precisely: http://www.gnu.org/software/libc/manual/html_node/Pause-Problems.html#Pause%20Problems The fix should be pretty quick.
should be fixed now; using this: http://www.gnu.org/software/libc/manual/html_node/Sigsuspend.html#Sigsuspend (It was never possible to reliably reproduce this, so verifying the fix will probably require just not seeing it for some time.)
Could this bug and bz128059 be related?
Have not seen this issue of all services stuck in recovery lately.