Bug 133420

Summary: fenced doesn't wake up to do recovery
Product: [Retired] Red Hat Cluster Suite Reporter: Corey Marthaler <cmarthal>
Component: fenceAssignee: David Teigland <teigland>
Status: CLOSED CURRENTRELEASE QA Contact: GFS Bugs <gfs-bugs>
Severity: medium Docs Contact:
Priority: medium    
Version: 4CC: cluster-maint
Target Milestone: ---   
Target Release: ---   
Hardware: i686   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2005-01-31 23:37:16 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Corey Marthaler 2004-09-23 21:00:29 UTC
Description of problem:
After having nodes go down and attempted to be brought back into the
cluster, the nodes left up have their services stuck in the recovery
state.

This is very similar to bz 128432 which was closed because perl
Net-Telnet was not installed however it was in this case and fencing
had been working. The fence service on the shot nodes never gets back
up to the run state.

morph-01 and morph-06 were the nodes which were shot.


[root@morph-01 root]# cat /proc/cluster/services
Service          Name                              GID LID State     Code
Fence Domain:    "default"                           0   2 join     
S-1,80,6
[]


[root@morph-02 root]# cat /proc/cluster/services
Service          Name                              GID LID State     Code
Fence Domain:    "default"                           1   2 recover 4 -
[4 2 5 3]

DLM Lock Space:  "clvmd"                             2   3 recover 0 -
[2 3 4 5]

DLM Lock Space:  "corey0"                            3   4 recover 0 -
[4 2 5 3]

DLM Lock Space:  "corey1"                            5   6 recover 0 -
[4 2 5 3]

DLM Lock Space:  "corey2"                            7   8 recover 0 -
[4 2 5 3]

GFS Mount Group: "corey0"                            4   5 recover 0 -
[4 2 5 3]

GFS Mount Group: "corey1"                            6   7 recover 0 -
[4 2 5 3]

GFS Mount Group: "corey2"                            8   9 recover 0 -
[4 2 5 3]


root@morph-03 root]# cat /proc/cluster/services
Service          Name                              GID LID State     Code
Fence Domain:    "default"                           1   2 recover 2 -
[2 4 5 3]

DLM Lock Space:  "clvmd"                             2   3 recover 0 -
[2 3 4 5]

DLM Lock Space:  "corey0"                            3   4 recover 0 -
[2 4 5 3]

DLM Lock Space:  "corey1"                            5   6 recover 0 -
[2 4 5 3]

DLM Lock Space:  "corey2"                            7   8 recover 0 -
[2 4 5 3]

GFS Mount Group: "corey0"                            4   5 recover 0 -
[2 4 5 3]

GFS Mount Group: "corey1"                            6   7 recover 0 -
[2 4 5 3]

GFS Mount Group: "corey2"                            8   9 recover 0 -
[2 4 5 3]



[root@morph-04 root]# cat /proc/cluster/services
Service          Name                              GID LID State     Code
Fence Domain:    "default"                           1   2 recover 4 -
[2 4 5 3]

DLM Lock Space:  "clvmd"                             2   3 recover 0 -
[2 3 4 5]

DLM Lock Space:  "corey0"                            3   4 recover 0 -
[2 4 5 3]

DLM Lock Space:  "corey1"                            5   6 recover 0 -
[2 4 5 3]

DLM Lock Space:  "corey2"                            7   8 recover 0 -
[2 4 5 3]

GFS Mount Group: "corey0"                            4   5 recover 0 -
[2 4 5 3]

GFS Mount Group: "corey1"                            6   7 recover 0 -
[2 4 5 3]

GFS Mount Group: "corey2"                            8   9 recover 0 -
[2 4 5 3]


[root@morph-05 root]# cat /proc/cluster/services
Service          Name                              GID LID State     Code
Fence Domain:    "default"                           1   2 recover 4 -
[2 3 4 5]

DLM Lock Space:  "clvmd"                             2   3 recover 0 -
[2 3 4 5]

DLM Lock Space:  "corey0"                            3   4 recover 0 -
[2 3 4 5]

DLM Lock Space:  "corey1"                            5   6 recover 0 -
[2 3 4 5]

DLM Lock Space:  "corey2"                            7   8 recover 0 -
[2 3 4 5]

GFS Mount Group: "corey0"                            4   5 recover 0 -
[2 3 4 5]

GFS Mount Group: "corey1"                            6   7 recover 0 -
[2 3 4 5]

GFS Mount Group: "corey2"                            8   9 recover 0 -
[2 3 4 5]



[root@morph-06 root]# cat /proc/cluster/services
Service          Name                              GID LID State     Code
Fence Domain:    "default"                           0   2 join     
S-1,80,6
[]

How reproducible:
Sometimes

Comment 1 Christine Caulfield 2004-09-24 07:17:27 UTC
Assigning this to dave because they're all stuck behind fencing.

Comment 2 David Teigland 2004-09-24 07:38:04 UTC
morph-03 is the node where fenced is supposed to be doing fencing
(recover state 2) but for some reason hasn't finished.  You need to 
look at fenced on morph-03 to see what it's doing and where it (or
the agent it called) is stuck.

Nothing else can begin recovery until fencing is finished, of course.


Comment 3 Kiersten (Kerri) Anderson 2004-11-04 15:16:47 UTC
Updates with the proper version and component name.

Comment 4 Corey Marthaler 2004-11-10 16:27:28 UTC
reproduced this again this morning.

It appears that this time everyone is stuck waiting for morph-05 to
fence morph-02 (the node which was taken down and brought back up). So
now morph-02 is stuck trying to join back in and everyone else is
stuck in recovery waiting for morph-05 to fence morph-02. However
there are no messages in the syslog on morph-05 about it attempting
any fence operations on anyone.

Comment 5 David Teigland 2004-11-11 03:44:30 UTC
Without trying I saw this or something similar myself a couple days
ago.  I tried many things to reproduce it but couldn't.  I added some
new debugging data we can collect if it happens again.

- fenced now adds an extra line to syslog saying
  "fencing deferred to <nodeid>" when one of the other nodes is
   supposed to be doing the fencing.  Look for this info next time.

- /proc/cluster/sm_debug has some additional info -- it should verify
  where things are getting stuck.

Best, of course, would be to run with debugging output captured to a
file "fenced -D > fenced.log" (although this may not be practical
if you can't reproduce it very reliably.)


Comment 6 Corey Marthaler 2005-01-06 22:33:42 UTC
This could be related to bz143269, although in that bug _only_ the
mount group is stuck in recovery.

Comment 7 David Teigland 2005-01-07 02:47:39 UTC
Saw this again yesterday.  fenced is still sitting in pause() waiting
for a signal.  SM has started the sg which means the signal should
have been sent.  Is the signal getting lost somewhere or is fenced
missing it, or ... ?

I ran "killall -s SIGUSR1 fenced" to deliver the signal manually
and fenced correctly woke up, found it needed to fence va04 and
forked off fence_manual.

Comment 8 David Teigland 2005-01-07 15:11:38 UTC
tadpol provided this link which appears to explain this problem
precisely:
http://www.gnu.org/software/libc/manual/html_node/Pause-Problems.html#Pause%20Problems

The fix should be pretty quick.


Comment 9 David Teigland 2005-01-10 03:21:41 UTC
should be fixed now; using this:
http://www.gnu.org/software/libc/manual/html_node/Sigsuspend.html#Sigsuspend

(It was never possible to reliably reproduce this, so verifying the
fix will probably require just not seeing it for some time.)

Comment 10 Corey Marthaler 2005-01-10 22:09:08 UTC
Could this bug and bz128059 be related?  

Comment 11 Corey Marthaler 2005-01-31 23:37:16 UTC
Have not seen this issue of all services stuck in recovery lately.