229715 – cmirror panic in dm_cmirror:cluster_log_serverd

Bug 229715 - cmirror panic in dm_cmirror:cluster_log_serverd

Summary: cmirror panic in dm_cmirror:cluster_log_serverd

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Cluster Suite
Classification:	Retired
Component:	cmirror
Sub Component:
Version:	4
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	Jonathan Earl Brassow
QA Contact:	Cluster QE
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2007-02-22 21:31 UTC by Corey Marthaler
Modified:	2010-04-27 14:59 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2010-04-27 14:59:23 UTC
Embargoed:

Attachments	(Terms of Use)

Description Corey Marthaler 2007-02-22 21:31:37 UTC

Description of problem:
I was running a looping create/deactivate/delete of a single 3 legged cmirror
and the cluster eventually deadlocked. It appeared that I hit bz 207132 and
217895 before 2 of the 4 nodes in the cluster paniced with the following:


Unable to handle kernel NULL pointer dereference at 0000000000000000 RIP:
<ffffffffa02f3a86>{:dm_cmirror:cluster_log_serverd+829}
PML4 215b4c067 PGD 0
Oops: 0000 [1] SMP
CPU 1
Modules linked in:
qla2300 qla2xxx scsi_transport_fc dm_cmirror(U) gnbd(U) lock_nolock(U) gfs(U)
lock_harness(U) dlm(U) cman(U) md5 ipv6 parport_pc lp parport autofs4 sunrpc ds
yenta_socket
pcmcia_core button battery ac uhci_hcd ehci_hcd hw_random e1000a floppy ata_piix
libata sg
dm_snapshot dm_zero dm_mirror ext3 jbd dm_mod megaraid_mbox megaraid_mm sd_mod
scsi_mod
Pid: 18906, comm: cluster_log_ser Not tainted 2.6.9-48.ELsmp
RIP: 0010:[<ffffffffa02f3a86>]
<ffffffffa02f3a86>{:dm_cmirror:cluster_log_serverd+829}
RSP: 0018:00000100ca121d88  EFLAGS: 00010246
RAX: fffffffffffffe48 RBX: ffffffffa02f9208 RCX: 0000000000000246
RDX: 0000000000000000 RSI: 0000000000000246 RDI: ffffffff803e5600
RBP: 0000010037d9d800 R08: 00000000000927bf R09: 00000100dfd68a00
R10: ffffffff80318d00 R11: 0000ffff80400460 R12: fffffffffffffe48
R13: 0000000000000001 R14: 00000100ca121dd8 R15: 00000101e56e0680
FS:  0000000000000000(0000) GS:ffffffff804ed580(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000000 CR3: 00000000dff88000 CR4: 00000000000006e0
Process cluster_log_ser (pid: 18906, threadinfo 00000100ca120000, task
0000010214e557f0)
Stack: 0000000000000000 0000000000000000 000000000000000 0000000000000000
       0000000000000000 00000101e56e0680 000000000000000 0000c73f00000004
       0000000000000002 0000000000000000

Call Trace:  <ffffffff80134660>{default_wake_function+0}
             <ffffffff80110f47>{child_rip+8}
             <ffffffffa02f3749>{:dm_cmirror:cluster_log_serverd+0}
             <ffffffff80110f3f>{child_rip+0}

Code: 8b 84 24 b8 1 01 00 00 0f 18 08 48 81 4c 93 2f a0 0f 84
RIP <ffffffffa02f3a86>{:dm_cmirror:cluster_log_serverd+829} RSP <00000100ca121d88>
CR2: 0000000000000000
<0>Kernel panic - not syncing: Oops

Version-Release number of selected component (if applicable):
2.6.9-48.ELsmp
lvm2-2.02.21-3.el4
lvm2-cluster-2.02.21-3.el4
cmirror-kernel-smp-2.6.9-20.0
cmirror-1.0.1-1

Comment 1 Jonathan Earl Brassow 2007-02-23 17:26:06 UTC

how reproducable is this?  Minutes?  Hours?  Haven't seen it again?

Comment 2 Jonathan Earl Brassow 2007-02-23 18:00:14 UTC

Also, where there any other cmirrors present or just the one you were
creating/deactivating/deleting?

Comment 3 Corey Marthaler 2007-02-23 19:59:48 UTC

It took a few hours of running the cmd loop before the issues took place, and I
did have another mirror sitting active, but idle in the cluster

Comment 4 Jonathan Earl Brassow 2007-02-23 20:42:39 UTC

Guessing from the assembly code, I would say that the cluster_log_serverd thread
is dereferencing a log context that was *just* freed by the client.  There is no
protection for the log list, so this seems possible.

Simple locking would fix this problem.  However, I'd like to make sure that this
is what is happening...

Comment 5 Jonathan Earl Brassow 2007-02-26 17:38:20 UTC

add locking around the log list.  There was a small window of opportunity
for the log server to look up a log in the list while another entry was
being deleted (bad for the server).

Comment 7 Corey Marthaler 2007-08-24 18:38:28 UTC

Marking this fix verified as I haven't been able to reproduce since the fix went in.

Comment 9 Alasdair Kergon 2010-04-27 14:59:23 UTC

Assuming this VERIFIED fix got released.  Closing.
Reopen if it's not yet resolved.

Note You need to log in before you can comment on or make changes to this bug.