Description of problem: Was starting/stopping my 2-node cluster and got a kernel Oops on both nodes when clvmd was started. 1. ccsd 2. cman_tool join (attain quorum) 3. fence_tool join 4. clvmd 5. reverse steps 6. repeat Got the Oops on step 4 after 7 iterations. FIRST NODE: =========== Unable to handle kernel paging request at virtual address e0365798 printing eip: c01aa741 *pde = 014ed067 Oops: 0002 [#1] Modules linked in: loop dlm cman ipv6 parport_pc lp parport autofs4 sunrpc e1000 floppy sg microcode dm_mod uhci_hcd ehci_hcd button battery asus_acpi ac ext3 jbd qla2300 qla2xxx scsi_transport_fc sd_mod scsi_mod CPU: 0 EIP: 0060:[<c01aa741>] Not tainted EFLAGS: 00010282 (2.6.7) EIP is at kobject_add+0xb1/0xf0 eax: c0328008 ebx: c0328000 ecx: e0365798 edx: de1eca88 esi: c0328048 edi: de1eca6c ebp: df76c660 esp: d6b19ea4 ds: 007b es: 007b ss: 0068 Process clvmd (pid: 16940, threadinfo=d6b18000 task=d6f0c830) Stack: de1eca6c de1eca64 df76c64c df76c638 c01ff9ce de1eca6c de1ecaac c01aa5e4 de1eca64 de1eca58 fffffff4 de1eca64 df76c638 c01fffbf d6b19f08 00000000 00000005 dd917364 c0320ba8 ffffffff c01e1794 df76c638 00a0003d 00000000 Call Trace: [<c01ff9ce>] class_device_add+0x5e/0x120 [<c01aa5e4>] kobject_init+0x24/0x40 [<c01fffbf>] class_simple_device_add+0xaf/0xe0 [<c01e1794>] misc_register+0xc4/0x180 [<e04bf3ba>] register_lockspace+0x11a/0x190 [dlm] [<e04bfad3>] dlm_ctl_ioctl+0x63/0xc0 [dlm] [<c014c62f>] filp_open+0x4f/0x60 [<c015cf7f>] sys_ioctl+0xcf/0x210 [<c0105cad>] sysenter_past_esp+0x52/0x71 Code: 89 11 89 4a 04 8b 47 28 8b 18 8d 4b 48 89 c8 ba ff ff 00 00 Aug 4 16:58:13 link-10 kernel: dlm: clvmd: recover event 3 done Aug 4 16:58:13 link-10 kernel: dlm: clvmd: recover event 3 finished Aug 4 16:58:14 link-10 kernel: Unable to handle kernel paging request at virtual address e0365798 Aug 4 16:58:14 link-10 kernel: printing eip: Aug 4 16:58:14 link-10 kernel: c01aa741 Aug 4 16:58:14 link-10 kernel: *pde = 014ed067 Aug 4 16:58:14 link-10 kernel: Oops: 0002 [#1] Aug 4 16:58:14 link-10 kernel: Modules linked in: loop dlm cman ipv6 parport_pc lp parport autofs4 sunrpc e1000 floppy sg microcode dm_mod uhci_hcd ehci_hcd button battery asus_acpi ac ext3 jbd qla2300 qla2xxx scsi_transport_fc sd_mod scsi_mod Aug 4 16:58:14 link-10 kernel: CPU: 0 Aug 4 16:58:14 link-10 kernel: EIP: 0060:[<c01aa741>] Not tainted Aug 4 16:58:14 link-10 kernel: EFLAGS: 00010282 (2.6.7) Aug 4 16:58:14 link-10 kernel: EIP is at kobject_add+0xb1/0xf0 Aug 4 16:58:14 link-10 kernel: eax: c0328008 ebx: c0328000 ecx: e0365798 edx: de1eca88 Aug 4 16:58:14 link-10 kernel: esi: c0328048 edi: de1eca6c ebp: df76c660 esp: d6b19ea4 Aug 4 16:58:14 link-10 kernel: ds: 007b es: 007b ss: 0068 Aug 4 16:58:14 link-10 kernel: Process clvmd (pid: 16940, threadinfo=d6b18000 task=d6f0c830) Aug 4 16:58:14 link-10 kernel: Stack: de1eca6c de1eca64 df76c64c df76c638 c01ff9ce de1eca6c de1ecaac c01aa5e4 Aug 4 16:58:14 link-10 kernel: de1eca64 de1eca58 fffffff4 de1eca64 df76c638 c01fffbf d6b19f08 00000000 Aug 4 16:58:14 link-10 kernel: 00000005 dd917364 c0320ba8 ffffffff c01e1794 df76c638 00a0003d 00000000 Aug 4 16:58:14 link-10 kernel: Call Trace: Aug 4 16:58:14 link-10 kernel: [<c01ff9ce>] class_device_add+0x5e/0x120 Aug 4 16:58:14 link-10 kernel: [<c01aa5e4>] kobject_init+0x24/0x40 Aug 4 16:58:14 link-10 kernel: [<c01fffbf>] class_simple_device_add+0xaf/0xe0 Aug 4 16:58:14 link-10 kernel: [<c01e1794>] misc_register+0xc4/0x180 Aug 4 16:58:14 link-10 kernel: [<e04bf3ba>] register_lockspace+0x11a/0x190 [dlm] Aug 4 16:58:14 link-10 kernel: [<e04bfad3>] dlm_ctl_ioctl+0x63/0xc0 [dlm] Aug 4 16:58:14 link-10 kernel: [<c014c62f>] filp_open+0x4f/0x60 Aug 4 16:58:14 link-10 kernel: [<c015cf7f>] sys_ioctl+0xcf/0x210 Aug 4 16:58:14 link-10 kernel: [<c0105cad>] sysenter_past_esp+0x52/0x71 Aug 4 16:58:14 link-10 kernel: Aug 4 16:58:14 link-10 kernel: Code: 89 11 89 4a 04 8b 47 28 8b 18 8d 4b 48 89 c8 ba ff ff 00 00 SECOND NODE: ============ Unable to handle kernel paging request at virtual address e0365798 printing eip: c01aa741 *pde = 014ed067 Oops: 0002 [#1] Modules linked in: loop dlm cman ipv6 parport_pc lp parport autofs4 sunrpc e1000 floppy sg microcode dm_mod uhci_hcd ehci_hcd button battery asus_acpi ac ext3 jbd qla2300 qla2xxx scsi_transport_fc sd_mod scsi_mod CPU: 0 EIP: 0060:[<c01aa741>] Not tainted EFLAGS: 00010282 (2.6.7) EIP is at kobject_add+0xb1/0xf0 eax: c0328008 ebx: c0328000 ecx: e0365798 edx: dccf8208 esi: c0328048 edi: dccf81ec ebp: df76c660 esp: d7229ea4 ds: 007b es: 007b ss: 0068 Process clvmd (pid: 15649, threadinfo=d7228000 task=da5f8b30) Stack: dccf81ec dccf81e4 df76c64c df76c638 c01ff9ce dccf81ec dccf822c c01aa5e4 dccf81e4 dccf81d8 fffffff4 dccf81e4 df76c638 c01fffbf d7229f08 00000000 00000005 dccf8764 c0320ba8 ffffffff c01e1794 df76c638 00a0003d 00000000 Call Trace: [<c01ff9ce>] class_device_add+0x5e/0x120 [<c01aa5e4>] kobject_init+0x24/0x40 [<c01fffbf>] class_simple_device_add+0xaf/0xe0 [<c01e1794>] misc_register+0xc4/0x180 [<e04bf3ba>] register_lockspace+0x11a/0x190 [dlm] [<e04bfad3>] dlm_ctl_ioctl+0x63/0xc0 [dlm] [<e01f0064>] e1000_xmit_frame+0x4f4/0x7a0 [e1000] [<c0273e7c>] net_rx_action+0x6c/0xf0 [<c011e809>] __do_softirq+0x79/0x80 [<c015cf7f>] sys_ioctl+0xcf/0x210 [<c0105cad>] sysenter_past_esp+0x52/0x71 Code: 89 11 89 4a 04 8b 47 28 8b 18 8d 4b 48 89 c8 ba ff ff 00 00 Aug 4 16:54:08 link-11 kernel: dlm: clvmd: total nodes 1 Aug 4 16:54:08 link-11 kernel: dlm: clvmd: rebuild resource directory Aug 4 16:54:08 link-11 kernel: dlm: clvmd: rebuilt 0 resources Aug 4 16:54:08 link-11 kernel: dlm: clvmd: recover event 2 done Aug 4 16:54:08 link-11 kernel: dlm: clvmd: recover event 2 finished Aug 4 16:54:08 link-11 kernel: Unable to handle kernel paging request at virtual address e0365798 Aug 4 16:54:08 link-11 kernel: printing eip: Aug 4 16:54:08 link-11 kernel: c01aa741 Aug 4 16:54:08 link-11 kernel: *pde = 014ed067 Aug 4 16:54:08 link-11 kernel: Oops: 0002 [#1] Aug 4 16:54:08 link-11 kernel: Modules linked in: loop dlm cman ipv6 parport_pc lp parport autofs4 sunrpc e1000 floppy sg microcode dm_mod uhci_hcd ehci_hcd button battery asus_acpi ac ext3 jbd qla2300 qla2xxx scsi_transport_fc sd_mod scsi_mod Aug 4 16:54:08 link-11 kernel: CPU: 0 Aug 4 16:54:08 link-11 kernel: EIP: 0060:[<c01aa741>] Not tainted Aug 4 16:54:08 link-11 kernel: EFLAGS: 00010282 (2.6.7) Aug 4 16:54:08 link-11 kernel: EIP is at kobject_add+0xb1/0xf0 Aug 4 16:54:08 link-11 kernel: eax: c0328008 ebx: c0328000 ecx: e0365798 edx: dccf8208 Aug 4 16:54:08 link-11 kernel: esi: c0328048 edi: dccf81ec ebp: df76c660 esp: d7229ea4 Aug 4 16:54:08 link-11 kernel: ds: 007b es: 007b ss: 0068 Aug 4 16:54:08 link-11 kernel: Process clvmd (pid: 15649, threadinfo=d7228000 task=da5f8b30) Aug 4 16:54:08 link-11 kernel: Stack: dccf81ec dccf81e4 df76c64c df76c638 c01ff9ce dccf81ec dccf822c c01aa5e4 Aug 4 16:54:08 link-11 kernel: dccf81e4 dccf81d8 fffffff4 dccf81e4 df76c638 c01fffbf d7229f08 00000000 Aug 4 16:54:08 link-11 kernel: 00000005 dccf8764 c0320ba8 ffffffff c01e1794 df76c638 00a0003d 00000000 Aug 4 16:54:08 link-11 kernel: Call Trace: Aug 4 16:54:08 link-11 kernel: [<c01ff9ce>] class_device_add+0x5e/0x120 Aug 4 16:54:08 link-11 kernel: [<c01aa5e4>] kobject_init+0x24/0x40 Aug 4 16:54:08 link-11 kernel: [<c01fffbf>] class_simple_device_add+0xaf/0xe0 Aug 4 16:54:08 link-11 kernel: [<c01e1794>] misc_register+0xc4/0x180 Aug 4 16:54:08 link-11 kernel: [<e04bf3ba>] register_lockspace+0x11a/0x190 [dlm] Aug 4 16:54:08 link-11 kernel: [<e04bfad3>] dlm_ctl_ioctl+0x63/0xc0 [dlm] Aug 4 16:54:08 link-11 kernel: [<e01f0064>] e1000_xmit_frame+0x4f4/0x7a0 [e1000] Aug 4 16:54:08 link-11 kernel: [<c0273e7c>] net_rx_action+0x6c/0xf0 Aug 4 16:54:08 link-11 kernel: [<c011e809>] __do_softirq+0x79/0x80 Aug 4 16:54:08 link-11 kernel: [<c015cf7f>] sys_ioctl+0xcf/0x210 Aug 4 16:54:08 link-11 kernel: [<c0105cad>] sysenter_past_esp+0x52/0x71 Aug 4 16:54:08 link-11 kernel: Aug 4 16:54:08 link-11 kernel: Code: 89 11 89 4a 04 8b 47 28 8b 18 8d 4b 48 89 c8 ba ff ff 00 00 Version-Release number of selected component (if applicable): [root@link-11 root]# clvmd -V Cluster LVM Daemon version 0.2.1 How reproducible: Haven't tried yet. Steps to Reproduce: 1. Listed above. 2. 3. Actual results: Expected results: Additional info:
Those symptoms are very strange. Do you have a script I can use to try and reproduce this ?
OK, I'm fairly convinced that this is a symptom of some other bug. I've seen a pool corruption reported by a debug kernel when doing a cman_tool leave so this would be a likely culprit.
Updating version to the right level in the defects. Sorry for the storm.
I've not seen this and I suspect it has been fixed by other checkins. Punt it back if you see it again or have something I can reproduce it with.
Well, it doesn't Oops anymore. It does _hang_ in the state below, but that's another bug, I guess. DLM Lock Space: "clvmd" 4 4 update U-4,1,12 [8 11 10 12]