Bug 210541
Summary: | Cluster node hangs in vgscan at reboot time | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 5 | Reporter: | Robert Peterson <rpeterso> |
Component: | cman | Assignee: | Christine Caulfield <ccaulfie> |
Status: | CLOSED CURRENTRELEASE | QA Contact: | Cluster QE <mspqa-list> |
Severity: | medium | Docs Contact: | |
Priority: | medium | ||
Version: | 5.0 | CC: | cluster-maint, teigland |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | All | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2009-04-17 19:42:30 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Robert Peterson
2006-10-12 19:26:55 UTC
The only "odd" message there is the -32 (EPIPE) one which implies that a node has gone down - oh and you forgot to attach the file :) You might like to try this patch, but I don't think it will make much difference t o be honest. logs from the various daemons are, I think, needed here. diff --git a/fs/dlm/lowcomms.c b/fs/dlm/lowcomms.c index 867f93d..82f2ac0 100644 --- a/fs/dlm/lowcomms.c +++ b/fs/dlm/lowcomms.c @@ -519,6 +519,7 @@ static int receive_from_sock(void) msg.msg_flags = 0; msg.msg_control = incmsg; msg.msg_controllen = sizeof(incmsg); + msg.msg_iovlen = 1; /* I don't see why this circular buffer stuff is necessary for SCTP * which is a packet-based protocol, but the whole thing breaks under It might also be useful to get the "cman_tool status" and "group_tool" outputs. I wonder if they look anything like this: [root@bench-12 cluster]# ./cman/cman_tool/cman_tool nodes Node Sts Inc Joined Name 12 M 8036 2006-10-16 05:43:45 bench-12.lab.msp.redhat.com 13 X 8040 bench-13.lab.msp.redhat.com 14 M 8660 2006-10-16 05:44:25 bench-14.lab.msp.redhat.com 15 X 8048 bench-15.lab.msp.redhat.com 16 X 8060 bench-16.lab.msp.redhat.com 17 X 8060 bench-17.lab.msp.redhat.com 18 X 8060 bench-18.lab.msp.redhat.com 19 X 8044 bench-19.lab.msp.redhat.com [root@bench-12 cluster]# ./group/tool/group_tool type level name id state fence 0 default 0001000e none [12 13 14 15 16 17 18] dlm 1 clvmd 0002000e none [12 13 14 15 16 17 18] I can see DLM attempting to contact node bench-13 (which fails of course) with -32 (EPIPE) errors attached to them Checking in commands.c; /cvs/cluster/cluster/cman/daemon/commands.c,v <-- commands.c new revision: 1.53; previous revision: 1.52 done Should fix the odd status above. see how you get on with that. Moving all RHCS ver 5 bugs to RHEL 5 so we can remove RHCS v5 which never existed. I haven't seen this problem for ages; closing CURRENT_RELEASE. |