Bugzilla will be upgraded to version 5.0 on a still to be determined date in the near future. The original upgrade date has been delayed.
Bug 229715 - cmirror panic in dm_cmirror:cluster_log_serverd
cmirror panic in dm_cmirror:cluster_log_serverd
Product: Red Hat Cluster Suite
Classification: Red Hat
Component: cmirror (Show other bugs)
All Linux
medium Severity medium
: ---
: ---
Assigned To: Jonathan Earl Brassow
Cluster QE
Depends On:
  Show dependency treegraph
Reported: 2007-02-22 16:31 EST by Corey Marthaler
Modified: 2010-04-27 10:59 EDT (History)
5 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Last Closed: 2010-04-27 10:59:23 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Attachments (Terms of Use)

  None (edit)
Description Corey Marthaler 2007-02-22 16:31:37 EST
Description of problem:
I was running a looping create/deactivate/delete of a single 3 legged cmirror
and the cluster eventually deadlocked. It appeared that I hit bz 207132 and
217895 before 2 of the 4 nodes in the cluster paniced with the following:

Unable to handle kernel NULL pointer dereference at 0000000000000000 RIP:
PML4 215b4c067 PGD 0
Oops: 0000 [1] SMP
Modules linked in:
qla2300 qla2xxx scsi_transport_fc dm_cmirror(U) gnbd(U) lock_nolock(U) gfs(U)
lock_harness(U) dlm(U) cman(U) md5 ipv6 parport_pc lp parport autofs4 sunrpc ds
pcmcia_core button battery ac uhci_hcd ehci_hcd hw_random e1000a floppy ata_piix
libata sg
dm_snapshot dm_zero dm_mirror ext3 jbd dm_mod megaraid_mbox megaraid_mm sd_mod
Pid: 18906, comm: cluster_log_ser Not tainted 2.6.9-48.ELsmp
RIP: 0010:[<ffffffffa02f3a86>]
RSP: 0018:00000100ca121d88  EFLAGS: 00010246
RAX: fffffffffffffe48 RBX: ffffffffa02f9208 RCX: 0000000000000246
RDX: 0000000000000000 RSI: 0000000000000246 RDI: ffffffff803e5600
RBP: 0000010037d9d800 R08: 00000000000927bf R09: 00000100dfd68a00
R10: ffffffff80318d00 R11: 0000ffff80400460 R12: fffffffffffffe48
R13: 0000000000000001 R14: 00000100ca121dd8 R15: 00000101e56e0680
FS:  0000000000000000(0000) GS:ffffffff804ed580(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000000 CR3: 00000000dff88000 CR4: 00000000000006e0
Process cluster_log_ser (pid: 18906, threadinfo 00000100ca120000, task
Stack: 0000000000000000 0000000000000000 000000000000000 0000000000000000
       0000000000000000 00000101e56e0680 000000000000000 0000c73f00000004
       0000000000000002 0000000000000000

Call Trace:  <ffffffff80134660>{default_wake_function+0}

Code: 8b 84 24 b8 1 01 00 00 0f 18 08 48 81 4c 93 2f a0 0f 84
RIP <ffffffffa02f3a86>{:dm_cmirror:cluster_log_serverd+829} RSP <00000100ca121d88>
CR2: 0000000000000000
<0>Kernel panic - not syncing: Oops

Version-Release number of selected component (if applicable):
Comment 1 Jonathan Earl Brassow 2007-02-23 12:26:06 EST
how reproducable is this?  Minutes?  Hours?  Haven't seen it again?
Comment 2 Jonathan Earl Brassow 2007-02-23 13:00:14 EST
Also, where there any other cmirrors present or just the one you were
Comment 3 Corey Marthaler 2007-02-23 14:59:48 EST
It took a few hours of running the cmd loop before the issues took place, and I
did have another mirror sitting active, but idle in the cluster
Comment 4 Jonathan Earl Brassow 2007-02-23 15:42:39 EST
Guessing from the assembly code, I would say that the cluster_log_serverd thread
is dereferencing a log context that was *just* freed by the client.  There is no
protection for the log list, so this seems possible.

Simple locking would fix this problem.  However, I'd like to make sure that this
is what is happening...
Comment 5 Jonathan Earl Brassow 2007-02-26 12:38:20 EST
add locking around the log list.  There was a small window of opportunity
for the log server to look up a log in the list while another entry was
being deleted (bad for the server).
Comment 7 Corey Marthaler 2007-08-24 14:38:28 EDT
Marking this fix verified as I haven't been able to reproduce since the fix went in.
Comment 9 Alasdair Kergon 2010-04-27 10:59:23 EDT
Assuming this VERIFIED fix got released.  Closing.
Reopen if it's not yet resolved.

Note You need to log in before you can comment on or make changes to this bug.