Bugzilla will be upgraded to version 5.0 on a still to be determined date in the near future. The original upgrade date has been delayed.
Bug 427531 - Intermittent Kernel Panic when attempting to add or remove a node from the cluster
Intermittent Kernel Panic when attempting to add or remove a node from the cl...
Status: CLOSED ERRATA
Product: Red Hat Cluster Suite
Classification: Red Hat
Component: dlm-kernel (Show other bugs)
4
i386 Linux
urgent Severity urgent
: ---
: ---
Assigned To: David Teigland
Cluster QE
: ZStream
Depends On:
Blocks: 441758
  Show dependency treegraph
 
Reported: 2008-01-04 09:26 EST by Need Real Name
Modified: 2009-04-16 15:49 EDT (History)
1 user (show)

See Also:
Fixed In Version: RHBA-2008-0796
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2008-07-25 15:17:49 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)


External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2008:0796 normal SHIPPED_LIVE dlm-kernel bug fix update 2008-07-25 15:17:43 EDT

  None (edit)
Description Need Real Name 2008-01-04 09:26:32 EST
Description of problem:When attempting to add or remove nods from the cluster, 
we areintermittently receiveing the following kernel panic, and the node is 
being fenced.  Occasionally the entire cluster will hang.


Version-Release number of selected component (if applicable):
redhat 4.3 

How reproducible:
This problem is intermittent, but will occasionally happen when added a node 
back into the cluster.
Actual results:

Total_votes: 3
Quorum: 3
Active subsystems: 8
Node name: HLABAPP4-int
Node addresses: 1.1.33.4

[root@HLABAPP4 ~]# tail -f /var/log/start
tail: cannot open `/var/log/start' for reading: No such file or directory
tail: no files remaining
[root@HLABAPP4 ~]#
[root@HLABAPP4 ~]# tail -f /var/log/messages
Dec 18 09:37:21 HLABAPP4 last message repeated 13 times
Dec 18 09:37:22 HLABAPP4 clurgmgrd: [4721]: <info> 
Executing /etc/init.d/hlab_launch_prod
Dec 18 09:37:26 HLABAPP4 kernel: CMAN: Join request from HLABAPP1-int rejected, 
config ver
Dec 18 09:37:46 HLABAPP4 last message repeated 9 times
Dec 18 09:37:51 HLABAPP4 kernel: CMAN: node HLABAPP1-int rejoining
Dec 18 09:37:52 HLABAPP4 clurgmgrd: [4721]: <info> 
Executing /etc/init.d/hlab_launch_prod
Dec 18 09:38:07 HLABAPP4 clurgmgrd[4721]: <info> Magma Event: Membership Change
Dec 18 09:38:07 HLABAPP4 clurgmgrd[4721]: <info> State change: HLABAPP1-int UP
Dec 18 09:38:24 HLABAPP4 clurgmgrd: [4721]: <info> 
Executing /etc/init.d/hlab_launch_prod
Dec 18 09:38:55 HLABAPP4 clurgmgrd: [4721]: <info> 
Executing /etc/init.d/hlab_launch_prod
Dec 18 09:39:56 HLABAPP4 last message repeated 2 times
Dec 18 09:40:01 HLABAPP4 crond(pam_unix)[30688]: session opened for user root 
by (uid=0)
Dec 18 09:40:01 HLABAPP4 crond(pam_unix)[30688]: session closed for user root
Dec 18 09:40:25 HLABAPP4 clurgmgrd[4721]: <info> Magma Event: Membership Change
Dec 18 09:40:25 HLABAPP4 clurgmgrd[4721]: <info> State change: HLABAPP3-int UP
Dec 18 09:40:25 HLABAPP4 kernel: s 0,0,1 ids 13,19,19
Dec 18 09:40:25 HLABAPP4 kernel: clvmd process held requests
Dec 18 09:40:25 HLABAPP4 kernel: clvmd processed 0 requests
Dec 18 09:40:25 HLABAPP4 kernel: clvmd resend marked requests
Dec 18 09:40:25 HLABAPP4 kernel: clvmd resent 0 requests
Dec 18 09:40:25 HLABAPP4 kernel: clvmd recover event 19 finished
Dec 18 09:40:25 HLABAPP4 kernel: /svr_home move flags 1,0,0 ids 15,15,15
Dec 18 09:40:25 HLABAPP4 kernel: /svr_home move flags 0,1,0 ids 15,21,15
Dec 18 09:40:25 HLABAPP4 kernel: /svr_home move use event 21
Dec 18 09:40:25 HLABAPP4 kernel: /svr_home recover event 21
Dec 18 09:40:25 HLABAPP4 kernel: /svr_home add node 4
Dec 18 09:40:25 HLABAPP4 kernel: /svr_home total nodes 4
Dec 18 09:40:25 HLABAPP4 kernel: /svr_home rebuild resource directory
Dec 18 09:40:25 HLABAPP4 kernel: /svr_home rebuilt 181 resources
Dec 18 09:40:25 HLABAPP4 kernel: /svr_home purge requests
Dec 18 09:40:25 HLABAPP4 kernel: /svr_home purged 0 requests
Dec 18 09:40:25 HLABAPP4 kernel: /svr_home mark waiting requests
Dec 18 09:40:25 HLABAPP4 kernel: /svr_home marked 0 requests
Dec 18 09:40:25 HLABAPP4 kernel: /svr_home recover event 21 done
Dec 18 09:40:25 HLABAPP4 kernel: /svr_home move flags 0,0,1 ids 15,21,21
Dec 18 09:40:25 HLABAPP4 kernel: /svr_home process held requests
Dec 18 09:40:25 HLABAPP4 kernel: /svr_home processed 0 requests
Dec 18 09:40:25 HLABAPP4 kernel: /svr_home resend marked requests
Dec 18 09:40:25 HLABAPP4 kernel: /svr_home resent 0 requests
Dec 18 09:40:25 HLABAPP4 kernel: /svr_home recover event 21 finished
Dec 18 09:40:25 HLABAPP4 kernel: Magma move flags 1,0,0 ids 18,18,18
Dec 18 09:40:25 HLABAPP4 kernel: Magma add_to_requestq cmd 5 fr 2
Dec 18 09:40:25 HLABAPP4 kernel: Magma move flags 0,1,0 ids 18,24,18
Dec 18 09:40:25 HLABAPP4 kernel: Magma move use event 24
Dec 18 09:40:25 HLABAPP4 kernel: Magma recover event 24
Dec 18 09:40:25 HLABAPP4 kernel: Magma add node 4
Dec 18 09:40:25 HLABAPP4 kernel: Magma total nodes 4
Dec 18 09:40:25 HLABAPP4 kernel: Magma rebuild resource directory
Dec 18 09:40:25 HLABAPP4 kernel: Magma rebuilt 0 resources
Dec 18 09:40:25 HLABAPP4 kernel: Magma purge requests
Dec 18 09:40:25 HLABAPP4 kernel:
Dec 18 09:40:25 HLABAPP4 kernel: DLM:  Assertion failed on line 246 of 
file /usr/src/redhat/BUILD/dlm-kernel-2.6.9-41/smp/src/lockqueue.c
Dec 18 09:40:25 HLABAPP4 kernel: DLM:  assertion:  "lkb"
Dec 18 09:40:25 HLABAPP4 kernel: DLM:  time = 482138424
Dec 18 09:40:25 HLABAPP4 kernel:
Dec 18 09:40:25 HLABAPP4 kernel: ------------[ cut here ]------------
Dec 18 09:40:25 HLABAPP4 kernel: kernel BUG at /usr/src/redhat/BUILD/dlm-kernel-
2.6.9-41/smp/src/lockqueue.c:246!
Dec 18 09:40:25 HLABAPP4 kernel: invalid operand: 0000 [#1]
Dec 18 09:40:25 HLABAPP4 kernel: SMP
Dec 18 09:40:25 HLABAPP4 kernel: Modules linked in: hangcheck_timer parport_pc 
lp parport i2c_dev i2c_core lock_dlm(U) gfs(U) lock_harness(U) sg dlm(U) cman
(U) md5 ipv6 emcphr(U) emcpmpap(U) emcpmpaa(U) emcpmpc(U) emcpmp(U) emcp(U) 
emcplib(U) joydev button battery ac ehci_hcd uhci_hcd hw_random bnx2 bond0(U) 
qla2400(U) dm_snapshot dm_zero dm_mirror ext3 jbd dm_mod qla2xxx(U) qla2xxx_conf
(U) cciss sd_mod scsi_mod
Dec 18 09:40:25 HLABAPP4 kernel: CPU:    6
Dec 18 09:40:25 HLABAPP4 kernel: EIP:    0060:[<f8e09585>]    Tainted: P      
VLI
Dec 18 09:40:25 HLABAPP4 kernel: EFLAGS: 00010246   (2.6.9-34.ELsmp)
Dec 18 09:40:25 HLABAPP4 kernel: EIP is at purge_requestqueue+0xc1/0x131 [dlm]
Dec 18 09:40:25 HLABAPP4 kernel: eax: 00000001   ebx: f629db00   ecx: 
f6e08f6c   edx: f8e15cef
Dec 18 09:40:25 HLABAPP4 kernel: esi: cb95a200   edi: f629db0c   ebp: 
00000000   esp: f6e08f68
Dec 18 09:40:25 HLABAPP4 kernel: ds: 007b   es: 007b   ss: 0068
Dec 18 09:40:25 HLABAPP4 kernel: Process dlm_recoverd (pid: 5127, 
threadinfo=f6e08000 task=f5342830)
Dec 18 09:40:25 HLABAPP4 kernel: Stack: f8e15cef f8e15c98 000000f6 f8e15c5a 
f8e15c56 1cbcd938 cb95a2bc 00000000
Dec 18 09:40:25 HLABAPP4 kernel:        cb95a200 f68cbc20 f8e13ddc f8e130ca 
00000000 00000004 00000002 cb95a200
Dec 18 09:40:25 HLABAPP4 kernel:        f8e13d3d f68cbc20 00000000 f6e08000 
00000000 cb95a200 f8e13e1b f6e08000
Dec 18 09:40:25 HLABAPP4 kernel: Call Trace:
Dec 18 09:40:25 HLABAPP4 kernel:  [<f8e13ddc>] dlm_recoverd+0x0/0x57 [dlm]
Dec 18 09:40:25 HLABAPP4 kernel:  [<f8e130ca>] ls_reconfig+0xb5/0x1be [dlm]
Dec 18 09:40:25 HLABAPP4 kernel:  [<f8e13d3d>] do_ls_recovery+0x298/0x337 [dlm]
Dec 18 09:40:25 HLABAPP4 kernel:  [<f8e13e1b>] dlm_recoverd+0x3f/0x57 [dlm]
Dec 18 09:40:25 HLABAPP4 kernel:  [<c0133ecd>] kthread+0x73/0x9b
Dec 18 09:40:25 HLABAPP4 kernel:  [<c0133e5a>] kthread+0x0/0x9b
Dec 18 09:40:25 HLABAPP4 kernel:  [<c01041f5>] kernel_thread_helper+0x5/0xb
Dec 18 09:40:25 HLABAPP4 kernel: Code: 00 00 a1 a0 28 32 c0 50 68 56 5c e1 f8 
68 5a 5c e1 f8 68 f6 00 00 00 68 98 5c e1 f8 e8 bf 90 31 c7 68 ef 5c e1 f8 e8 
b5 90 31 c7 <0f> 0b f6 00 5a 5c e1 f8 68 f1 5c e1 f8 e8 70 88 31 c7 83 78 38
Dec 18 09:40:25 HLABAPP4 kernel:  <0>Fatal exception: panic in 5 seconds



Expected results:
Expected the node to come back online without any problems.
Comment 1 David Teigland 2008-01-04 11:14:48 EST
Some message has gotten out of place; we can just ignore it, there's
no need to panic the machine.

CVSROOT:        /cvs/cluster
Module name:    cluster
Branch:         RHEL4
Changes by:     teigland@sourceware.org 2008-01-04 16:12:05

Modified files:
        dlm-kernel/src : lockqueue.c

Log message:
        Some message gets out of place, but there's no need to panic
        the machine; just ignore it.  bz 427531
Comment 2 RHEL Product and Program Management 2008-03-17 15:48:04 EDT
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.
Comment 8 errata-xmlrpc 2008-07-25 15:17:49 EDT
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2008-0796.html

Note You need to log in before you can comment on or make changes to this bug.