Bug 427531

Summary: Intermittent Kernel Panic when attempting to add or remove a node from the cluster
Product: [Retired] Red Hat Cluster Suite Reporter: Need Real Name <tom.wills>
Component: dlm-kernelAssignee: David Teigland <teigland>
Status: CLOSED ERRATA QA Contact: Cluster QE <mspqa-list>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 4CC: cluster-maint
Target Milestone: ---Keywords: ZStream
Target Release: ---   
Hardware: i386   
OS: Linux   
Whiteboard:
Fixed In Version: RHBA-2008-0796 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2008-07-25 19:17:49 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 441758    

Description Need Real Name 2008-01-04 14:26:32 UTC
Description of problem:When attempting to add or remove nods from the cluster, 
we areintermittently receiveing the following kernel panic, and the node is 
being fenced.  Occasionally the entire cluster will hang.


Version-Release number of selected component (if applicable):
redhat 4.3 

How reproducible:
This problem is intermittent, but will occasionally happen when added a node 
back into the cluster.
Actual results:

Total_votes: 3
Quorum: 3
Active subsystems: 8
Node name: HLABAPP4-int
Node addresses: 1.1.33.4

[root@HLABAPP4 ~]# tail -f /var/log/start
tail: cannot open `/var/log/start' for reading: No such file or directory
tail: no files remaining
[root@HLABAPP4 ~]#
[root@HLABAPP4 ~]# tail -f /var/log/messages
Dec 18 09:37:21 HLABAPP4 last message repeated 13 times
Dec 18 09:37:22 HLABAPP4 clurgmgrd: [4721]: <info> 
Executing /etc/init.d/hlab_launch_prod
Dec 18 09:37:26 HLABAPP4 kernel: CMAN: Join request from HLABAPP1-int rejected, 
config ver
Dec 18 09:37:46 HLABAPP4 last message repeated 9 times
Dec 18 09:37:51 HLABAPP4 kernel: CMAN: node HLABAPP1-int rejoining
Dec 18 09:37:52 HLABAPP4 clurgmgrd: [4721]: <info> 
Executing /etc/init.d/hlab_launch_prod
Dec 18 09:38:07 HLABAPP4 clurgmgrd[4721]: <info> Magma Event: Membership Change
Dec 18 09:38:07 HLABAPP4 clurgmgrd[4721]: <info> State change: HLABAPP1-int UP
Dec 18 09:38:24 HLABAPP4 clurgmgrd: [4721]: <info> 
Executing /etc/init.d/hlab_launch_prod
Dec 18 09:38:55 HLABAPP4 clurgmgrd: [4721]: <info> 
Executing /etc/init.d/hlab_launch_prod
Dec 18 09:39:56 HLABAPP4 last message repeated 2 times
Dec 18 09:40:01 HLABAPP4 crond(pam_unix)[30688]: session opened for user root 
by (uid=0)
Dec 18 09:40:01 HLABAPP4 crond(pam_unix)[30688]: session closed for user root
Dec 18 09:40:25 HLABAPP4 clurgmgrd[4721]: <info> Magma Event: Membership Change
Dec 18 09:40:25 HLABAPP4 clurgmgrd[4721]: <info> State change: HLABAPP3-int UP
Dec 18 09:40:25 HLABAPP4 kernel: s 0,0,1 ids 13,19,19
Dec 18 09:40:25 HLABAPP4 kernel: clvmd process held requests
Dec 18 09:40:25 HLABAPP4 kernel: clvmd processed 0 requests
Dec 18 09:40:25 HLABAPP4 kernel: clvmd resend marked requests
Dec 18 09:40:25 HLABAPP4 kernel: clvmd resent 0 requests
Dec 18 09:40:25 HLABAPP4 kernel: clvmd recover event 19 finished
Dec 18 09:40:25 HLABAPP4 kernel: /svr_home move flags 1,0,0 ids 15,15,15
Dec 18 09:40:25 HLABAPP4 kernel: /svr_home move flags 0,1,0 ids 15,21,15
Dec 18 09:40:25 HLABAPP4 kernel: /svr_home move use event 21
Dec 18 09:40:25 HLABAPP4 kernel: /svr_home recover event 21
Dec 18 09:40:25 HLABAPP4 kernel: /svr_home add node 4
Dec 18 09:40:25 HLABAPP4 kernel: /svr_home total nodes 4
Dec 18 09:40:25 HLABAPP4 kernel: /svr_home rebuild resource directory
Dec 18 09:40:25 HLABAPP4 kernel: /svr_home rebuilt 181 resources
Dec 18 09:40:25 HLABAPP4 kernel: /svr_home purge requests
Dec 18 09:40:25 HLABAPP4 kernel: /svr_home purged 0 requests
Dec 18 09:40:25 HLABAPP4 kernel: /svr_home mark waiting requests
Dec 18 09:40:25 HLABAPP4 kernel: /svr_home marked 0 requests
Dec 18 09:40:25 HLABAPP4 kernel: /svr_home recover event 21 done
Dec 18 09:40:25 HLABAPP4 kernel: /svr_home move flags 0,0,1 ids 15,21,21
Dec 18 09:40:25 HLABAPP4 kernel: /svr_home process held requests
Dec 18 09:40:25 HLABAPP4 kernel: /svr_home processed 0 requests
Dec 18 09:40:25 HLABAPP4 kernel: /svr_home resend marked requests
Dec 18 09:40:25 HLABAPP4 kernel: /svr_home resent 0 requests
Dec 18 09:40:25 HLABAPP4 kernel: /svr_home recover event 21 finished
Dec 18 09:40:25 HLABAPP4 kernel: Magma move flags 1,0,0 ids 18,18,18
Dec 18 09:40:25 HLABAPP4 kernel: Magma add_to_requestq cmd 5 fr 2
Dec 18 09:40:25 HLABAPP4 kernel: Magma move flags 0,1,0 ids 18,24,18
Dec 18 09:40:25 HLABAPP4 kernel: Magma move use event 24
Dec 18 09:40:25 HLABAPP4 kernel: Magma recover event 24
Dec 18 09:40:25 HLABAPP4 kernel: Magma add node 4
Dec 18 09:40:25 HLABAPP4 kernel: Magma total nodes 4
Dec 18 09:40:25 HLABAPP4 kernel: Magma rebuild resource directory
Dec 18 09:40:25 HLABAPP4 kernel: Magma rebuilt 0 resources
Dec 18 09:40:25 HLABAPP4 kernel: Magma purge requests
Dec 18 09:40:25 HLABAPP4 kernel:
Dec 18 09:40:25 HLABAPP4 kernel: DLM:  Assertion failed on line 246 of 
file /usr/src/redhat/BUILD/dlm-kernel-2.6.9-41/smp/src/lockqueue.c
Dec 18 09:40:25 HLABAPP4 kernel: DLM:  assertion:  "lkb"
Dec 18 09:40:25 HLABAPP4 kernel: DLM:  time = 482138424
Dec 18 09:40:25 HLABAPP4 kernel:
Dec 18 09:40:25 HLABAPP4 kernel: ------------[ cut here ]------------
Dec 18 09:40:25 HLABAPP4 kernel: kernel BUG at /usr/src/redhat/BUILD/dlm-kernel-
2.6.9-41/smp/src/lockqueue.c:246!
Dec 18 09:40:25 HLABAPP4 kernel: invalid operand: 0000 [#1]
Dec 18 09:40:25 HLABAPP4 kernel: SMP
Dec 18 09:40:25 HLABAPP4 kernel: Modules linked in: hangcheck_timer parport_pc 
lp parport i2c_dev i2c_core lock_dlm(U) gfs(U) lock_harness(U) sg dlm(U) cman
(U) md5 ipv6 emcphr(U) emcpmpap(U) emcpmpaa(U) emcpmpc(U) emcpmp(U) emcp(U) 
emcplib(U) joydev button battery ac ehci_hcd uhci_hcd hw_random bnx2 bond0(U) 
qla2400(U) dm_snapshot dm_zero dm_mirror ext3 jbd dm_mod qla2xxx(U) qla2xxx_conf
(U) cciss sd_mod scsi_mod
Dec 18 09:40:25 HLABAPP4 kernel: CPU:    6
Dec 18 09:40:25 HLABAPP4 kernel: EIP:    0060:[<f8e09585>]    Tainted: P      
VLI
Dec 18 09:40:25 HLABAPP4 kernel: EFLAGS: 00010246   (2.6.9-34.ELsmp)
Dec 18 09:40:25 HLABAPP4 kernel: EIP is at purge_requestqueue+0xc1/0x131 [dlm]
Dec 18 09:40:25 HLABAPP4 kernel: eax: 00000001   ebx: f629db00   ecx: 
f6e08f6c   edx: f8e15cef
Dec 18 09:40:25 HLABAPP4 kernel: esi: cb95a200   edi: f629db0c   ebp: 
00000000   esp: f6e08f68
Dec 18 09:40:25 HLABAPP4 kernel: ds: 007b   es: 007b   ss: 0068
Dec 18 09:40:25 HLABAPP4 kernel: Process dlm_recoverd (pid: 5127, 
threadinfo=f6e08000 task=f5342830)
Dec 18 09:40:25 HLABAPP4 kernel: Stack: f8e15cef f8e15c98 000000f6 f8e15c5a 
f8e15c56 1cbcd938 cb95a2bc 00000000
Dec 18 09:40:25 HLABAPP4 kernel:        cb95a200 f68cbc20 f8e13ddc f8e130ca 
00000000 00000004 00000002 cb95a200
Dec 18 09:40:25 HLABAPP4 kernel:        f8e13d3d f68cbc20 00000000 f6e08000 
00000000 cb95a200 f8e13e1b f6e08000
Dec 18 09:40:25 HLABAPP4 kernel: Call Trace:
Dec 18 09:40:25 HLABAPP4 kernel:  [<f8e13ddc>] dlm_recoverd+0x0/0x57 [dlm]
Dec 18 09:40:25 HLABAPP4 kernel:  [<f8e130ca>] ls_reconfig+0xb5/0x1be [dlm]
Dec 18 09:40:25 HLABAPP4 kernel:  [<f8e13d3d>] do_ls_recovery+0x298/0x337 [dlm]
Dec 18 09:40:25 HLABAPP4 kernel:  [<f8e13e1b>] dlm_recoverd+0x3f/0x57 [dlm]
Dec 18 09:40:25 HLABAPP4 kernel:  [<c0133ecd>] kthread+0x73/0x9b
Dec 18 09:40:25 HLABAPP4 kernel:  [<c0133e5a>] kthread+0x0/0x9b
Dec 18 09:40:25 HLABAPP4 kernel:  [<c01041f5>] kernel_thread_helper+0x5/0xb
Dec 18 09:40:25 HLABAPP4 kernel: Code: 00 00 a1 a0 28 32 c0 50 68 56 5c e1 f8 
68 5a 5c e1 f8 68 f6 00 00 00 68 98 5c e1 f8 e8 bf 90 31 c7 68 ef 5c e1 f8 e8 
b5 90 31 c7 <0f> 0b f6 00 5a 5c e1 f8 68 f1 5c e1 f8 e8 70 88 31 c7 83 78 38
Dec 18 09:40:25 HLABAPP4 kernel:  <0>Fatal exception: panic in 5 seconds



Expected results:
Expected the node to come back online without any problems.

Comment 1 David Teigland 2008-01-04 16:14:48 UTC
Some message has gotten out of place; we can just ignore it, there's
no need to panic the machine.

CVSROOT:        /cvs/cluster
Module name:    cluster
Branch:         RHEL4
Changes by:     teigland 2008-01-04 16:12:05

Modified files:
        dlm-kernel/src : lockqueue.c

Log message:
        Some message gets out of place, but there's no need to panic
        the machine; just ignore it.  bz 427531


Comment 2 RHEL Program Management 2008-03-17 19:48:04 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 8 errata-xmlrpc 2008-07-25 19:17:49 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2008-0796.html