Description of problem:When attempting to add or remove nods from the cluster, we areintermittently receiveing the following kernel panic, and the node is being fenced. Occasionally the entire cluster will hang. Version-Release number of selected component (if applicable): redhat 4.3 How reproducible: This problem is intermittent, but will occasionally happen when added a node back into the cluster. Actual results: Total_votes: 3 Quorum: 3 Active subsystems: 8 Node name: HLABAPP4-int Node addresses: 1.1.33.4 [root@HLABAPP4 ~]# tail -f /var/log/start tail: cannot open `/var/log/start' for reading: No such file or directory tail: no files remaining [root@HLABAPP4 ~]# [root@HLABAPP4 ~]# tail -f /var/log/messages Dec 18 09:37:21 HLABAPP4 last message repeated 13 times Dec 18 09:37:22 HLABAPP4 clurgmgrd: [4721]: <info> Executing /etc/init.d/hlab_launch_prod Dec 18 09:37:26 HLABAPP4 kernel: CMAN: Join request from HLABAPP1-int rejected, config ver Dec 18 09:37:46 HLABAPP4 last message repeated 9 times Dec 18 09:37:51 HLABAPP4 kernel: CMAN: node HLABAPP1-int rejoining Dec 18 09:37:52 HLABAPP4 clurgmgrd: [4721]: <info> Executing /etc/init.d/hlab_launch_prod Dec 18 09:38:07 HLABAPP4 clurgmgrd[4721]: <info> Magma Event: Membership Change Dec 18 09:38:07 HLABAPP4 clurgmgrd[4721]: <info> State change: HLABAPP1-int UP Dec 18 09:38:24 HLABAPP4 clurgmgrd: [4721]: <info> Executing /etc/init.d/hlab_launch_prod Dec 18 09:38:55 HLABAPP4 clurgmgrd: [4721]: <info> Executing /etc/init.d/hlab_launch_prod Dec 18 09:39:56 HLABAPP4 last message repeated 2 times Dec 18 09:40:01 HLABAPP4 crond(pam_unix)[30688]: session opened for user root by (uid=0) Dec 18 09:40:01 HLABAPP4 crond(pam_unix)[30688]: session closed for user root Dec 18 09:40:25 HLABAPP4 clurgmgrd[4721]: <info> Magma Event: Membership Change Dec 18 09:40:25 HLABAPP4 clurgmgrd[4721]: <info> State change: HLABAPP3-int UP Dec 18 09:40:25 HLABAPP4 kernel: s 0,0,1 ids 13,19,19 Dec 18 09:40:25 HLABAPP4 kernel: clvmd process held requests Dec 18 09:40:25 HLABAPP4 kernel: clvmd processed 0 requests Dec 18 09:40:25 HLABAPP4 kernel: clvmd resend marked requests Dec 18 09:40:25 HLABAPP4 kernel: clvmd resent 0 requests Dec 18 09:40:25 HLABAPP4 kernel: clvmd recover event 19 finished Dec 18 09:40:25 HLABAPP4 kernel: /svr_home move flags 1,0,0 ids 15,15,15 Dec 18 09:40:25 HLABAPP4 kernel: /svr_home move flags 0,1,0 ids 15,21,15 Dec 18 09:40:25 HLABAPP4 kernel: /svr_home move use event 21 Dec 18 09:40:25 HLABAPP4 kernel: /svr_home recover event 21 Dec 18 09:40:25 HLABAPP4 kernel: /svr_home add node 4 Dec 18 09:40:25 HLABAPP4 kernel: /svr_home total nodes 4 Dec 18 09:40:25 HLABAPP4 kernel: /svr_home rebuild resource directory Dec 18 09:40:25 HLABAPP4 kernel: /svr_home rebuilt 181 resources Dec 18 09:40:25 HLABAPP4 kernel: /svr_home purge requests Dec 18 09:40:25 HLABAPP4 kernel: /svr_home purged 0 requests Dec 18 09:40:25 HLABAPP4 kernel: /svr_home mark waiting requests Dec 18 09:40:25 HLABAPP4 kernel: /svr_home marked 0 requests Dec 18 09:40:25 HLABAPP4 kernel: /svr_home recover event 21 done Dec 18 09:40:25 HLABAPP4 kernel: /svr_home move flags 0,0,1 ids 15,21,21 Dec 18 09:40:25 HLABAPP4 kernel: /svr_home process held requests Dec 18 09:40:25 HLABAPP4 kernel: /svr_home processed 0 requests Dec 18 09:40:25 HLABAPP4 kernel: /svr_home resend marked requests Dec 18 09:40:25 HLABAPP4 kernel: /svr_home resent 0 requests Dec 18 09:40:25 HLABAPP4 kernel: /svr_home recover event 21 finished Dec 18 09:40:25 HLABAPP4 kernel: Magma move flags 1,0,0 ids 18,18,18 Dec 18 09:40:25 HLABAPP4 kernel: Magma add_to_requestq cmd 5 fr 2 Dec 18 09:40:25 HLABAPP4 kernel: Magma move flags 0,1,0 ids 18,24,18 Dec 18 09:40:25 HLABAPP4 kernel: Magma move use event 24 Dec 18 09:40:25 HLABAPP4 kernel: Magma recover event 24 Dec 18 09:40:25 HLABAPP4 kernel: Magma add node 4 Dec 18 09:40:25 HLABAPP4 kernel: Magma total nodes 4 Dec 18 09:40:25 HLABAPP4 kernel: Magma rebuild resource directory Dec 18 09:40:25 HLABAPP4 kernel: Magma rebuilt 0 resources Dec 18 09:40:25 HLABAPP4 kernel: Magma purge requests Dec 18 09:40:25 HLABAPP4 kernel: Dec 18 09:40:25 HLABAPP4 kernel: DLM: Assertion failed on line 246 of file /usr/src/redhat/BUILD/dlm-kernel-2.6.9-41/smp/src/lockqueue.c Dec 18 09:40:25 HLABAPP4 kernel: DLM: assertion: "lkb" Dec 18 09:40:25 HLABAPP4 kernel: DLM: time = 482138424 Dec 18 09:40:25 HLABAPP4 kernel: Dec 18 09:40:25 HLABAPP4 kernel: ------------[ cut here ]------------ Dec 18 09:40:25 HLABAPP4 kernel: kernel BUG at /usr/src/redhat/BUILD/dlm-kernel- 2.6.9-41/smp/src/lockqueue.c:246! Dec 18 09:40:25 HLABAPP4 kernel: invalid operand: 0000 [#1] Dec 18 09:40:25 HLABAPP4 kernel: SMP Dec 18 09:40:25 HLABAPP4 kernel: Modules linked in: hangcheck_timer parport_pc lp parport i2c_dev i2c_core lock_dlm(U) gfs(U) lock_harness(U) sg dlm(U) cman (U) md5 ipv6 emcphr(U) emcpmpap(U) emcpmpaa(U) emcpmpc(U) emcpmp(U) emcp(U) emcplib(U) joydev button battery ac ehci_hcd uhci_hcd hw_random bnx2 bond0(U) qla2400(U) dm_snapshot dm_zero dm_mirror ext3 jbd dm_mod qla2xxx(U) qla2xxx_conf (U) cciss sd_mod scsi_mod Dec 18 09:40:25 HLABAPP4 kernel: CPU: 6 Dec 18 09:40:25 HLABAPP4 kernel: EIP: 0060:[<f8e09585>] Tainted: P VLI Dec 18 09:40:25 HLABAPP4 kernel: EFLAGS: 00010246 (2.6.9-34.ELsmp) Dec 18 09:40:25 HLABAPP4 kernel: EIP is at purge_requestqueue+0xc1/0x131 [dlm] Dec 18 09:40:25 HLABAPP4 kernel: eax: 00000001 ebx: f629db00 ecx: f6e08f6c edx: f8e15cef Dec 18 09:40:25 HLABAPP4 kernel: esi: cb95a200 edi: f629db0c ebp: 00000000 esp: f6e08f68 Dec 18 09:40:25 HLABAPP4 kernel: ds: 007b es: 007b ss: 0068 Dec 18 09:40:25 HLABAPP4 kernel: Process dlm_recoverd (pid: 5127, threadinfo=f6e08000 task=f5342830) Dec 18 09:40:25 HLABAPP4 kernel: Stack: f8e15cef f8e15c98 000000f6 f8e15c5a f8e15c56 1cbcd938 cb95a2bc 00000000 Dec 18 09:40:25 HLABAPP4 kernel: cb95a200 f68cbc20 f8e13ddc f8e130ca 00000000 00000004 00000002 cb95a200 Dec 18 09:40:25 HLABAPP4 kernel: f8e13d3d f68cbc20 00000000 f6e08000 00000000 cb95a200 f8e13e1b f6e08000 Dec 18 09:40:25 HLABAPP4 kernel: Call Trace: Dec 18 09:40:25 HLABAPP4 kernel: [<f8e13ddc>] dlm_recoverd+0x0/0x57 [dlm] Dec 18 09:40:25 HLABAPP4 kernel: [<f8e130ca>] ls_reconfig+0xb5/0x1be [dlm] Dec 18 09:40:25 HLABAPP4 kernel: [<f8e13d3d>] do_ls_recovery+0x298/0x337 [dlm] Dec 18 09:40:25 HLABAPP4 kernel: [<f8e13e1b>] dlm_recoverd+0x3f/0x57 [dlm] Dec 18 09:40:25 HLABAPP4 kernel: [<c0133ecd>] kthread+0x73/0x9b Dec 18 09:40:25 HLABAPP4 kernel: [<c0133e5a>] kthread+0x0/0x9b Dec 18 09:40:25 HLABAPP4 kernel: [<c01041f5>] kernel_thread_helper+0x5/0xb Dec 18 09:40:25 HLABAPP4 kernel: Code: 00 00 a1 a0 28 32 c0 50 68 56 5c e1 f8 68 5a 5c e1 f8 68 f6 00 00 00 68 98 5c e1 f8 e8 bf 90 31 c7 68 ef 5c e1 f8 e8 b5 90 31 c7 <0f> 0b f6 00 5a 5c e1 f8 68 f1 5c e1 f8 e8 70 88 31 c7 83 78 38 Dec 18 09:40:25 HLABAPP4 kernel: <0>Fatal exception: panic in 5 seconds Expected results: Expected the node to come back online without any problems.
Some message has gotten out of place; we can just ignore it, there's no need to panic the machine. CVSROOT: /cvs/cluster Module name: cluster Branch: RHEL4 Changes by: teigland 2008-01-04 16:12:05 Modified files: dlm-kernel/src : lockqueue.c Log message: Some message gets out of place, but there's no need to panic the machine; just ignore it. bz 427531
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2008-0796.html