133420 – fenced doesn't wake up to do recovery

Bug 133420 - fenced doesn't wake up to do recovery

Summary: fenced doesn't wake up to do recovery

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Cluster Suite
Classification:	Retired
Component:	fence
Sub Component:
Version:	4
Hardware:	i686
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	David Teigland
QA Contact:	GFS Bugs
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2004-09-23 21:00 UTC by Corey Marthaler
Modified:	2009-04-16 20:29 UTC (History)
CC List:	1 user (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2005-01-31 23:37:16 UTC
Embargoed:

Attachments	(Terms of Use)

Description Corey Marthaler 2004-09-23 21:00:29 UTC

Description of problem:
After having nodes go down and attempted to be brought back into the
cluster, the nodes left up have their services stuck in the recovery
state.

This is very similar to bz 128432 which was closed because perl
Net-Telnet was not installed however it was in this case and fencing
had been working. The fence service on the shot nodes never gets back
up to the run state.

morph-01 and morph-06 were the nodes which were shot.


[root@morph-01 root]# cat /proc/cluster/services
Service          Name                              GID LID State     Code
Fence Domain:    "default"                           0   2 join     
S-1,80,6
[]


[root@morph-02 root]# cat /proc/cluster/services
Service          Name                              GID LID State     Code
Fence Domain:    "default"                           1   2 recover 4 -
[4 2 5 3]

DLM Lock Space:  "clvmd"                             2   3 recover 0 -
[2 3 4 5]

DLM Lock Space:  "corey0"                            3   4 recover 0 -
[4 2 5 3]

DLM Lock Space:  "corey1"                            5   6 recover 0 -
[4 2 5 3]

DLM Lock Space:  "corey2"                            7   8 recover 0 -
[4 2 5 3]

GFS Mount Group: "corey0"                            4   5 recover 0 -
[4 2 5 3]

GFS Mount Group: "corey1"                            6   7 recover 0 -
[4 2 5 3]

GFS Mount Group: "corey2"                            8   9 recover 0 -
[4 2 5 3]


root@morph-03 root]# cat /proc/cluster/services
Service          Name                              GID LID State     Code
Fence Domain:    "default"                           1   2 recover 2 -
[2 4 5 3]

DLM Lock Space:  "clvmd"                             2   3 recover 0 -
[2 3 4 5]

DLM Lock Space:  "corey0"                            3   4 recover 0 -
[2 4 5 3]

DLM Lock Space:  "corey1"                            5   6 recover 0 -
[2 4 5 3]

DLM Lock Space:  "corey2"                            7   8 recover 0 -
[2 4 5 3]

GFS Mount Group: "corey0"                            4   5 recover 0 -
[2 4 5 3]

GFS Mount Group: "corey1"                            6   7 recover 0 -
[2 4 5 3]

GFS Mount Group: "corey2"                            8   9 recover 0 -
[2 4 5 3]



[root@morph-04 root]# cat /proc/cluster/services
Service          Name                              GID LID State     Code
Fence Domain:    "default"                           1   2 recover 4 -
[2 4 5 3]

DLM Lock Space:  "clvmd"                             2   3 recover 0 -
[2 3 4 5]

DLM Lock Space:  "corey0"                            3   4 recover 0 -
[2 4 5 3]

DLM Lock Space:  "corey1"                            5   6 recover 0 -
[2 4 5 3]

DLM Lock Space:  "corey2"                            7   8 recover 0 -
[2 4 5 3]

GFS Mount Group: "corey0"                            4   5 recover 0 -
[2 4 5 3]

GFS Mount Group: "corey1"                            6   7 recover 0 -
[2 4 5 3]

GFS Mount Group: "corey2"                            8   9 recover 0 -
[2 4 5 3]


[root@morph-05 root]# cat /proc/cluster/services
Service          Name                              GID LID State     Code
Fence Domain:    "default"                           1   2 recover 4 -
[2 3 4 5]

DLM Lock Space:  "clvmd"                             2   3 recover 0 -
[2 3 4 5]

DLM Lock Space:  "corey0"                            3   4 recover 0 -
[2 3 4 5]

DLM Lock Space:  "corey1"                            5   6 recover 0 -
[2 3 4 5]

DLM Lock Space:  "corey2"                            7   8 recover 0 -
[2 3 4 5]

GFS Mount Group: "corey0"                            4   5 recover 0 -
[2 3 4 5]

GFS Mount Group: "corey1"                            6   7 recover 0 -
[2 3 4 5]

GFS Mount Group: "corey2"                            8   9 recover 0 -
[2 3 4 5]



[root@morph-06 root]# cat /proc/cluster/services
Service          Name                              GID LID State     Code
Fence Domain:    "default"                           0   2 join     
S-1,80,6
[]

How reproducible:
Sometimes

Comment 1 Christine Caulfield 2004-09-24 07:17:27 UTC

Assigning this to dave because they're all stuck behind fencing.

Comment 2 David Teigland 2004-09-24 07:38:04 UTC

morph-03 is the node where fenced is supposed to be doing fencing
(recover state 2) but for some reason hasn't finished.  You need to 
look at fenced on morph-03 to see what it's doing and where it (or
the agent it called) is stuck.

Nothing else can begin recovery until fencing is finished, of course.

Comment 3 Kiersten (Kerri) Anderson 2004-11-04 15:16:47 UTC

Updates with the proper version and component name.

Comment 4 Corey Marthaler 2004-11-10 16:27:28 UTC

reproduced this again this morning.

It appears that this time everyone is stuck waiting for morph-05 to
fence morph-02 (the node which was taken down and brought back up). So
now morph-02 is stuck trying to join back in and everyone else is
stuck in recovery waiting for morph-05 to fence morph-02. However
there are no messages in the syslog on morph-05 about it attempting
any fence operations on anyone.

Comment 5 David Teigland 2004-11-11 03:44:30 UTC

Without trying I saw this or something similar myself a couple days
ago.  I tried many things to reproduce it but couldn't.  I added some
new debugging data we can collect if it happens again.

- fenced now adds an extra line to syslog saying
  "fencing deferred to <nodeid>" when one of the other nodes is
   supposed to be doing the fencing.  Look for this info next time.

- /proc/cluster/sm_debug has some additional info -- it should verify
  where things are getting stuck.

Best, of course, would be to run with debugging output captured to a
file "fenced -D > fenced.log" (although this may not be practical
if you can't reproduce it very reliably.)

Comment 6 Corey Marthaler 2005-01-06 22:33:42 UTC

This could be related to bz143269, although in that bug _only_ the
mount group is stuck in recovery.

Comment 7 David Teigland 2005-01-07 02:47:39 UTC

Saw this again yesterday.  fenced is still sitting in pause() waiting
for a signal.  SM has started the sg which means the signal should
have been sent.  Is the signal getting lost somewhere or is fenced
missing it, or ... ?

I ran "killall -s SIGUSR1 fenced" to deliver the signal manually
and fenced correctly woke up, found it needed to fence va04 and
forked off fence_manual.

Comment 8 David Teigland 2005-01-07 15:11:38 UTC

tadpol provided this link which appears to explain this problem
precisely:
http://www.gnu.org/software/libc/manual/html_node/Pause-Problems.html#Pause%20Problems

The fix should be pretty quick.

Comment 9 David Teigland 2005-01-10 03:21:41 UTC

should be fixed now; using this:
http://www.gnu.org/software/libc/manual/html_node/Sigsuspend.html#Sigsuspend

(It was never possible to reliably reproduce this, so verifying the
fix will probably require just not seeing it for some time.)

Comment 10 Corey Marthaler 2005-01-10 22:09:08 UTC

Could this bug and bz128059 be related?

Comment 11 Corey Marthaler 2005-01-31 23:37:16 UTC

Have not seen this issue of all services stuck in recovery lately.

Note You need to log in before you can comment on or make changes to this bug.