Bug 144144
| Summary: | clvmd registering with dlm causes "kernel BUG at fs/inode.c:1100!" | ||
|---|---|---|---|
| Product: | [Retired] Red Hat Cluster Suite | Reporter: | Derek Anderson <danderso> |
| Component: | dlm | Assignee: | Christine Caulfield <ccaulfie> |
| Status: | CLOSED CURRENTRELEASE | QA Contact: | Cluster QE <mspqa-list> |
| Severity: | medium | Docs Contact: | |
| Priority: | medium | ||
| Version: | 4 | CC: | cluster-maint |
| Target Milestone: | --- | ||
| Target Release: | --- | ||
| Hardware: | All | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2009-04-24 14:41:34 UTC | Type: | --- |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
I hit this as well, same rpms/kernel different cluster.
I did:
modprobe cman
modprobe dlm
ccsd
cman_tool join
fence_tool join
clvmd
killall clvmd
fence_tool leave
cman_tool leave
killall ccsd
ccsd
cman_tool join
fence_tool join
clvmd
------------[ cut here ]------------
kernel BUG at fs/inode.c:1100!
invalid operand: 0000 [#1]
SMP
Modules linked in: dlm(U) cman(U) md5 ipv6 parport_pc lp parport
autofs4 sunrpc e1000 microcode dm_mod uhci_hcd ehci_hcd button
battery ac ext3 jbd qla2300 qla2xxx scsi_transport_fc sd_mod
scsi_mod
CPU: 0
EIP: 0060:[<c016b943>] Tainted: GF VLI
EFLAGS: 00010246 (2.6.9-1.906_ELsmp)
EIP is at iput+0x19/0x61
eax: c033ac60 ebx: f5b639ac ecx: 00000000 edx: 00000000
esi: f5a176f0 edi: 00000000 ebp: f5969218 esp: f5c89ee4
ds: 007b es: 007b ss: 0068
Process clvmd (pid: 2220, threadinfo=f5c89000 task=f5b77970)
Stack: f5a176f8 f8a71f8c 00000000 ffffff9e f5969380 f8a734da
00000000 f5969280
00000005 f8a7101f f8a85c50 f8a7105e f8a716de f5969280
00000000 f596921e
ffffffff f8a6b20a 00000000 0000000a 00000000 f5c89f54
f5969218 f5969210
Call Trace:
[<f8a71f8c>] close_connection+0x3e/0x8e [dlm]
[<f8a734da>] lowcomms_start+0x191/0x1f6 [dlm]
[<f8a7101f>] threads_start+0x20/0x3e [dlm]
[<f8a7105e>] init_internal+0x17/0x30 [dlm]
[<f8a716de>] dlm_new_lockspace+0x39/0x61 [dlm]
[<f8a6b20a>] register_lockspace+0xa3/0x14a [dlm]
[<f8a6b9d6>] do_user_create_lockspace+0x21/0x32 [dlm]
[<f8a6c527>] dlm_write+0x156/0x1ae [dlm]
[<c0155650>] vfs_write+0xb6/0xe2
[<c015571a>] sys_write+0x3c/0x62
[<c02c6287>] syscall_call+0x7/0xb
Code: 5e c3 83 78 24 00 75 05 e9 00 fe ff ff e9 fa fe ff ff 53 85 c0
89 c3 74 58 83 bb 3c 01 00 00 20 8b 80 a4 00 00 00 8b 40 24 75 08
<0f> 0b 4c 04 37 d4 2d c0 85 c0 74 0b 8b 50 14 85 d2 74 04 89 d8
<0>Fatal exception: panic in 5 seconds
Kernel panic - not syncing: Fatal exception
The "can't bind" error has been around since day one, but the panic is new. This should get rid of that oops, con->sock was not being NULLed when the bind() failed. Checking in lowcomms.c; /cvs/cluster/cluster/dlm-kernel/src/lowcomms.c,v <-- lowcomms.c new revision: 1.24; previous revision: 1.23 done |
Description of problem: This happened at cluster startup after having been completely shut down. On all nodes: - ccsd - cman_tool join - fence_tool join - clvmd At this point 2 of the 3 nodes panicked with: dlm: Can't bind to port 21064 ------------[ cut here ]------------ kernel BUG at fs/inode.c:1100! invalid operand: 0000 [#1] SMP Modules linked in: dlm(U) cman(U) md5 ipv6 parport_pc lp parport autofs4 sunrpc e1000 microcode uhci_hcd ehci_hcd button battery ac ext3 jbd dm_mod qla2300 qla2xxx scsi_transport_fc sd_mod scsi_mod CPU: 0 EIP: 0060:[<c016b943>] Tainted: GF VLI EFLAGS: 00010246 (2.6.9-1.906_ELsmp) EIP is at iput+0x19/0x61 eax: c033ac60 ebx: d980cdac ecx: 00000000 edx: 00000000 esi: d997c6f0 edi: 00000000 ebp: dca72718 esp: dbf05ee4 ds: 007b es: 007b ss: 0068 Process clvmd (pid: 2520, threadinfo=dbf05000 task=d9a07870) Stack: d997c6f8 e0317f8c 00000000 ffffff9e dca80f80 e03194da 00000000 dca80f80 00000005 e031701f e032bc50 e031705e e03176de dca80f80 00000000 dca7271e ffffffff e031120a 00000000 0000000a 00000000 dbf05f54 dca72718 dca72710 Call Trace: [<e0317f8c>] close_connection+0x3e/0x8e [dlm] [<e03194da>] lowcomms_start+0x191/0x1f6 [dlm] [<e031701f>] threads_start+0x20/0x3e [dlm] [<e031705e>] init_internal+0x17/0x30 [dlm] [<e03176de>] dlm_new_lockspace+0x39/0x61 [dlm] [<e031120a>] register_lockspace+0xa3/0x14a [dlm] [<e03119d6>] do_user_create_lockspace+0x21/0x32 [dlm] [<e0312527>] dlm_write+0x156/0x1ae [dlm] [<c0155650>] vfs_write+0xb6/0xe2 [<c015571a>] sys_write+0x3c/0x62 [<c02c6287>] syscall_call+0x7/0xb Code: 5e c3 83 78 24 00 75 05 e9 00 fe ff ff e9 fa fe ff ff 53 85 c0 89 c3 74 58 83 bb 3c 01 00 00 20 8b 80 a4 00 00 00 8b 40 24 75 08 <0f> 0b 4c 04 37 d4 2d c0 85 c0 74 0b 8b 50 14 85 d2 74 04 89 d8 <0>Fatal exception: panic in 5 seconds Version-Release number of selected component (if applicable): 6.1 RPMS built on Wed 15 Dec 2004 01:13:08 PM CST How reproducible: Haven't seen it before, thought the "Can't bind to port 21064" also showed up in bugzilla #129458. Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info: