Description of problem: I was running a looping create/deactivate/delete of a single 3 legged cmirror and the cluster eventually deadlocked. It appeared that I hit bz 207132 and 217895 before 2 of the 4 nodes in the cluster paniced with the following: Unable to handle kernel NULL pointer dereference at 0000000000000000 RIP: <ffffffffa02f3a86>{:dm_cmirror:cluster_log_serverd+829} PML4 215b4c067 PGD 0 Oops: 0000 [1] SMP CPU 1 Modules linked in: qla2300 qla2xxx scsi_transport_fc dm_cmirror(U) gnbd(U) lock_nolock(U) gfs(U) lock_harness(U) dlm(U) cman(U) md5 ipv6 parport_pc lp parport autofs4 sunrpc ds yenta_socket pcmcia_core button battery ac uhci_hcd ehci_hcd hw_random e1000a floppy ata_piix libata sg dm_snapshot dm_zero dm_mirror ext3 jbd dm_mod megaraid_mbox megaraid_mm sd_mod scsi_mod Pid: 18906, comm: cluster_log_ser Not tainted 2.6.9-48.ELsmp RIP: 0010:[<ffffffffa02f3a86>] <ffffffffa02f3a86>{:dm_cmirror:cluster_log_serverd+829} RSP: 0018:00000100ca121d88 EFLAGS: 00010246 RAX: fffffffffffffe48 RBX: ffffffffa02f9208 RCX: 0000000000000246 RDX: 0000000000000000 RSI: 0000000000000246 RDI: ffffffff803e5600 RBP: 0000010037d9d800 R08: 00000000000927bf R09: 00000100dfd68a00 R10: ffffffff80318d00 R11: 0000ffff80400460 R12: fffffffffffffe48 R13: 0000000000000001 R14: 00000100ca121dd8 R15: 00000101e56e0680 FS: 0000000000000000(0000) GS:ffffffff804ed580(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 0000000000000000 CR3: 00000000dff88000 CR4: 00000000000006e0 Process cluster_log_ser (pid: 18906, threadinfo 00000100ca120000, task 0000010214e557f0) Stack: 0000000000000000 0000000000000000 000000000000000 0000000000000000 0000000000000000 00000101e56e0680 000000000000000 0000c73f00000004 0000000000000002 0000000000000000 Call Trace: <ffffffff80134660>{default_wake_function+0} <ffffffff80110f47>{child_rip+8} <ffffffffa02f3749>{:dm_cmirror:cluster_log_serverd+0} <ffffffff80110f3f>{child_rip+0} Code: 8b 84 24 b8 1 01 00 00 0f 18 08 48 81 4c 93 2f a0 0f 84 RIP <ffffffffa02f3a86>{:dm_cmirror:cluster_log_serverd+829} RSP <00000100ca121d88> CR2: 0000000000000000 <0>Kernel panic - not syncing: Oops Version-Release number of selected component (if applicable): 2.6.9-48.ELsmp lvm2-2.02.21-3.el4 lvm2-cluster-2.02.21-3.el4 cmirror-kernel-smp-2.6.9-20.0 cmirror-1.0.1-1
how reproducable is this? Minutes? Hours? Haven't seen it again?
Also, where there any other cmirrors present or just the one you were creating/deactivating/deleting?
It took a few hours of running the cmd loop before the issues took place, and I did have another mirror sitting active, but idle in the cluster
Guessing from the assembly code, I would say that the cluster_log_serverd thread is dereferencing a log context that was *just* freed by the client. There is no protection for the log list, so this seems possible. Simple locking would fix this problem. However, I'd like to make sure that this is what is happening...
add locking around the log list. There was a small window of opportunity for the log server to look up a log in the list while another entry was being deleted (bad for the server).
Marking this fix verified as I haven't been able to reproduce since the fix went in.
Assuming this VERIFIED fix got released. Closing. Reopen if it's not yet resolved.