Description of problem: Similar symptom to 210359 but the patch for 210359 is on and all those ugly messages are gone. As best as I can tell, this is a case of an sctp message not getting through. In this case, all nodes are functioning properly except for "kool", which is node id 4. Kool is stuck in "vgscan." cat /sys/kernel/dlm/clvmd/recover_status on kool said "1" which means DLM_RS_NODES. Based on the console messages, it seems as if kool never got out of dlm_recover_members. ps ax -o pid,cmd,wchan on kool revealed: 2009 [dlm_recoverd] dlm_wait_function 2022 /usr/sbin/vgscan unix_stream_recvmsg group_tool -v on all the other nodes gave: type level name id state node id local_done fence 0 default 00010002 none [1 2 3 4 5] dlm 1 clvmd 00010005 none [1 2 3 4 5] dlm 1 soot 00030002 none [1 2 3 5] gfs 2 soot 00020002 none [1 2 3 5] Furthermore, the last console messages on kool were: Starting cluster: Loading modules... DLM (built Sep 25 2006 13:21:54) installed GFS2 (built Sep 25 2006 13:22:27) installed done Mounting configfs... done Starting ccsd... done Starting cman... done Starting daemons... done Starting fencing... done [ OK ] Starting system message bus: [ OK ] Starting clvmd: Module sctp cannot be unloaded due to unsafe usage in net/sctp/protocol.c:1189 dlm: clvmd: recover 1 dlm: clvmd: add member 3 dlm: clvmd: add member 1 dlm: clvmd: add member 5 dlm: clvmd: add member 2 dlm: clvmd: add member 4 dlm: Initiating association with node 1 dlm: got new/restarted association 1 nodeid 1 dlm: Initiating association with node 2 [ OK ] But it never continued like the rest of the nodes which also had: dlm: clvmd: total members 5 error 0 dlm: clvmd: dlm_recover_directory dlm: clvmd: dlm_recover_directory 2 entries There were these strange messages on nodeid 2: dlm: clvmd: recover 7 dlm: clvmd: remove member 1 dlm: soot: recover 7 dlm: soot: remove member 1 GFS: fsid=smoke:soot.1: jid=2: Trying to acquire journal lock... dlm: clvmd: dlm_wait_function aborted dlm: clvmd: ping_members aborted -4 last nodeid 4 dlm: clvmd: total members 4 error -4 dlm: clvmd: recover_members failed -4 dlm: clvmd: recover 7 error -4 dlm: clvmd: recover 9 dlm: clvmd: add member 1 dlm: Initiating association with node 1 dlm: clvmd: reject old reply 5 got 16 wanted 19 dlm: soot: total members 4 error 0 dlm: soot: dlm_recover_directory dlm: clvmd: total members 5 error 0 dlm: clvmd: dlm_recover_directory dlm: clvmd: dlm_recover_directory 3 entries dlm: soot: dlm_recover_directory 13814 entries dlm: soot: dlm_purge_locks dlm: soot: dlm_recover_masters dlm: soot: dlm_recover_masters 8 resources dlm: soot: dlm_recover_locks dlm: soot: dlm_recover_locks 0 locks dlm: clvmd: recover 9 done: 524 ms dlm: soot: dlm_recover_rsbs dlm: soot: dlm_recover_rsbs 9 rsbs dlm: soot: recover 7 done: 12984 ms GFS: fsid=smoke:soot.1: jid=2: Looking at journal... GFS: fsid=smoke:soot.1: jid=2: Done dlm: soot: recover 9 dlm: soot: remove member 4 dlm: clvmd: recover b dlm: clvmd: remove member 4 GFS: fsid=smoke:soot.1: jid=4: Trying to acquire journal lock... dlm: soot: total members 3 error 0 dlm: soot: dlm_recover_directory dlm: soot: dlm_recover_directory 5 entries dlm: clvmd: total members 4 error 0 dlm: clvmd: dlm_recover_directory dlm: soot: dlm_purge_locks dlm: soot: dlm_recover_masters dlm: soot: dlm_recover_masters 1 resources dlm: soot: dlm_recover_locks dlm: soot: dlm_recover_locks 0 locks dlm: clvmd: dlm_recover_directory 3 entries dlm: soot: dlm_recover_rsbs dlm: soot: dlm_recover_rsbs 10 rsbs dlm: clvmd: dlm_purge_locks dlm: clvmd: dlm_recover_masters dlm: clvmd: dlm_recover_masters 0 resources dlm: clvmd: dlm_recover_locks dlm: clvmd: dlm_recover_locks 0 locks dlm: soot: recover 9 done: 168 ms GFS: fsid=smoke:soot.1: jid=4: Looking at journal... GFS: fsid=smoke:soot.1: jid=4: Done dlm: clvmd: dlm_recover_rsbs dlm: clvmd: dlm_recover_rsbs 0 rsbs dlm: clvmd: recover b done: 232 ms dlm: soot: recover b dlm: soot: add member 1 dlm: soot: total members 4 error 0 dlm: soot: dlm_recover_directory dlm: soot: dlm_recover_directory 3 entries dlm: soot: recover b done: 64 ms dlm: clvmd: recover d dlm: clvmd: add member 4 But this was a recovery test, so I don't know if it's related. The GFS partition in question seems to be successfully mounted on all the other nodes except, of course, the one that's hung in vgscan. Version-Release number of selected component (if applicable): RHEL5 test 1 with latest cluster code from 12 Oct 2006, including the patch for 210359. How reproducible: Unknown 1. cd /root/sts-test on "smoke" 2. ../gfs/bin/revolver -f var/share/resource_files/smoke.xml -l $PWD -i 0 -L LITE -I -t 1 Actual results: vgscan hangs forever. Console has error messages such as: dlm: Error sending to node 4 -32 dlm: clvmd: dlm_wait_function aborted (See attached file.) Expected results: No vgscan hang
The only "odd" message there is the -32 (EPIPE) one which implies that a node has gone down - oh and you forgot to attach the file :) You might like to try this patch, but I don't think it will make much difference t o be honest. logs from the various daemons are, I think, needed here. diff --git a/fs/dlm/lowcomms.c b/fs/dlm/lowcomms.c index 867f93d..82f2ac0 100644 --- a/fs/dlm/lowcomms.c +++ b/fs/dlm/lowcomms.c @@ -519,6 +519,7 @@ static int receive_from_sock(void) msg.msg_flags = 0; msg.msg_control = incmsg; msg.msg_controllen = sizeof(incmsg); + msg.msg_iovlen = 1; /* I don't see why this circular buffer stuff is necessary for SCTP * which is a packet-based protocol, but the whole thing breaks under
It might also be useful to get the "cman_tool status" and "group_tool" outputs. I wonder if they look anything like this: [root@bench-12 cluster]# ./cman/cman_tool/cman_tool nodes Node Sts Inc Joined Name 12 M 8036 2006-10-16 05:43:45 bench-12.lab.msp.redhat.com 13 X 8040 bench-13.lab.msp.redhat.com 14 M 8660 2006-10-16 05:44:25 bench-14.lab.msp.redhat.com 15 X 8048 bench-15.lab.msp.redhat.com 16 X 8060 bench-16.lab.msp.redhat.com 17 X 8060 bench-17.lab.msp.redhat.com 18 X 8060 bench-18.lab.msp.redhat.com 19 X 8044 bench-19.lab.msp.redhat.com [root@bench-12 cluster]# ./group/tool/group_tool type level name id state fence 0 default 0001000e none [12 13 14 15 16 17 18] dlm 1 clvmd 0002000e none [12 13 14 15 16 17 18] I can see DLM attempting to contact node bench-13 (which fails of course) with -32 (EPIPE) errors attached to them
Checking in commands.c; /cvs/cluster/cluster/cman/daemon/commands.c,v <-- commands.c new revision: 1.53; previous revision: 1.52 done Should fix the odd status above. see how you get on with that.
Moving all RHCS ver 5 bugs to RHEL 5 so we can remove RHCS v5 which never existed.
I haven't seen this problem for ages; closing CURRENT_RELEASE.