Bug 133420
Summary: | fenced doesn't wake up to do recovery | ||
---|---|---|---|
Product: | [Retired] Red Hat Cluster Suite | Reporter: | Corey Marthaler <cmarthal> |
Component: | fence | Assignee: | David Teigland <teigland> |
Status: | CLOSED CURRENTRELEASE | QA Contact: | GFS Bugs <gfs-bugs> |
Severity: | medium | Docs Contact: | |
Priority: | medium | ||
Version: | 4 | CC: | cluster-maint |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | i686 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2005-01-31 23:37:16 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Corey Marthaler
2004-09-23 21:00:29 UTC
Assigning this to dave because they're all stuck behind fencing. morph-03 is the node where fenced is supposed to be doing fencing (recover state 2) but for some reason hasn't finished. You need to look at fenced on morph-03 to see what it's doing and where it (or the agent it called) is stuck. Nothing else can begin recovery until fencing is finished, of course. Updates with the proper version and component name. reproduced this again this morning. It appears that this time everyone is stuck waiting for morph-05 to fence morph-02 (the node which was taken down and brought back up). So now morph-02 is stuck trying to join back in and everyone else is stuck in recovery waiting for morph-05 to fence morph-02. However there are no messages in the syslog on morph-05 about it attempting any fence operations on anyone. Without trying I saw this or something similar myself a couple days ago. I tried many things to reproduce it but couldn't. I added some new debugging data we can collect if it happens again. - fenced now adds an extra line to syslog saying "fencing deferred to <nodeid>" when one of the other nodes is supposed to be doing the fencing. Look for this info next time. - /proc/cluster/sm_debug has some additional info -- it should verify where things are getting stuck. Best, of course, would be to run with debugging output captured to a file "fenced -D > fenced.log" (although this may not be practical if you can't reproduce it very reliably.) This could be related to bz143269, although in that bug _only_ the mount group is stuck in recovery. Saw this again yesterday. fenced is still sitting in pause() waiting for a signal. SM has started the sg which means the signal should have been sent. Is the signal getting lost somewhere or is fenced missing it, or ... ? I ran "killall -s SIGUSR1 fenced" to deliver the signal manually and fenced correctly woke up, found it needed to fence va04 and forked off fence_manual. tadpol provided this link which appears to explain this problem precisely: http://www.gnu.org/software/libc/manual/html_node/Pause-Problems.html#Pause%20Problems The fix should be pretty quick. should be fixed now; using this: http://www.gnu.org/software/libc/manual/html_node/Sigsuspend.html#Sigsuspend (It was never possible to reliably reproduce this, so verifying the fix will probably require just not seeing it for some time.) Could this bug and bz128059 be related? Have not seen this issue of all services stuck in recovery lately. |