Bug 131004
Summary: | cman_tool join/leave hangs after many iterations. | ||
---|---|---|---|
Product: | [Retired] Red Hat Cluster Suite | Reporter: | Dean Jansa <djansa> |
Component: | gfs | Assignee: | Christine Caulfield <ccaulfie> |
Status: | CLOSED CURRENTRELEASE | QA Contact: | GFS Bugs <gfs-bugs> |
Severity: | medium | Docs Contact: | |
Priority: | medium | ||
Version: | 4 | ||
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | i686 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2004-09-21 14:22:14 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Dean Jansa
2004-08-26 16:10:59 UTC
Annoyingly I can't reproduce this on my UP or SMP systems unless it needs many more iterations than I infer from "several". Can you tell me whether cman_tool hangs on join or leave (latest dmesg output would also be handy) and also which cman daemons are running at the point of the hang ? Also, how many other machines are in the cluster at the time ? OK, I've managed to reproduce this now. I've added a missing wake call which might have been causing this. It's probably work testing again for that, it seems OK here for a few hours now. cman really needs to be updated to use the new kthreads interface at some point. Patrick -- Running another test I hit the following. I was doing leave/joins as well as GETMEMBERS ioctls when this happend. Looks like it may be related to the above issue? If not, let's open a new bug. 6 node cluster, taking 1 or 2 members out and then checking everyone sees the same cluster view. tank-03 did a join to rejoin the cluster and: Unable to handle kernel NULL pointer dereference at virtual address 00000001 printing eip: 00000001 *pde = 00000000 Oops: 0000 [#1] SMP Modules linked in: gnbd lock_gulm lock_nolock lock_dlm dlm cman gfs lock_harness ipv6 parport_pc lp parport autofs4 sunrpc e1000 floppy sg microcode dm_mod uhci_hcd ehci_hcd button battery asus_acpi ac ext3 jbd qla2300 qla2xxx scsi_transport_fc sd_mod scsi_mod CPU: 0 EIP: 0060:[<00000001>] Not tainted EFLAGS: 00010087 (2.6.8.1) EIP is at 0x1 eax: f5b83e50 ebx: f5b83e50 ecx: 00000000 edx: 00000001 esi: 00000000 edi: 8398d040 ebp: f5b83eec esp: f5b83ecc ds: 007b es: 007b ss: 0068 Process cman_comms (pid: 3873, threadinfo=f5b82000 task=f5c8f310) Stack: c011eff7 00000000 f8a83684 00000001 00000001 00000282 f8a83680 00000014 f5b83f04 c011f05f 00000000 00000000 00000000 f759fd80 f5f68800 f8a6701e 00000000 f8a67afd f5c8f310 c2019ca0 c201a600 f759fd80 f5f68800 00000014 Call Trace: [<c011eff7>] __wake_up_common+0x37/0x70 [<c011f05f>] __wake_up+0x2f/0x40 [<f8a6701e>] unjam+0x1e/0x40 [cman] [<f8a67afd>] send_to_userport+0xad/0x560 [cman] [<f8a671bc>] receive_message+0xcc/0xf0 [cman] [<f8a67359>] cluster_kthread+0x179/0x320 [cman] [<c011efb0>] default_wake_function+0x0/0x10 [<f8a671e0>] cluster_kthread+0x0/0x320 [cman] [<c01042b5>] kernel_thread_helper+0x5/0x10 Code: Bad EIP value. I hit another stack running the same thing, single or multi nodes leave cluster, all other nodes check that the agree on cluster view, rejoin. repeat.... New stack: (Triggered after a cman_tool leave): Unable to handle kernel NULL pointer dereference at virtual address 00000001 printing eip: 00000001 *pde = 00000000 Oops: 0000 [#1] SMP Modules linked in: cman ipv6 parport_pc lp parport autofs4 sunrpc e1000 floppy sg microcode dm_mod uhci_hcd ehci_hcd button battery asus_acpi ac ext3 jbd qla2300 qla2xxx scsi_transport_fc sd_mod scsi_mod CPU: 0 EIP: 0060:[<00000001>] Not tainted EFLAGS: 00010086 (2.6.8.1) EIP is at 0x1 eax: fffffff2 ebx: c011eff7 ecx: 00000000 edx: fffffff2 esi: 00000000 edi: f8a43684 ebp: 00000001 esp: f550ff60 ds: 007b es: 007b ss: 0068 Process cman_comms (pid: 2331, threadinfo=f550e000 task=c2362b70) Stack: 00000286 f8a43680 f7102800 f550ff84 c011f05f 00000000 00000000 f727cc80 f8a449c8 f550e000 f8a2701e 00000000 f8a2ab70 f8a39fc4 0136001e f727cc80 f8a449c8 f7102800 f550e000 f8a27443 f8a3b8e0 c2362b70 0000001f 00000000 Call Trace: [<c011f05f>] __wake_up+0x2f/0x40 [<f8a2701e>] unjam+0x1e/0x40 [cman] [<f8a2ab70>] node_shutdown+0x20/0x330 [cman] [<f8a27443>] cluster_kthread+0x263/0x320 [cman] [<c011efb0>] default_wake_function+0x0/0x10 [<f8a271e0>] cluster_kthread+0x0/0x320 [cman] [<c01042b5>] kernel_thread_helper+0x5/0x10 Code: Bad EIP value. It's definitely a different bug. If this morning's checkin doesn't fix it then raise a new bug report. I think this is worth a retest, bear in mind my previous comment... I have not seen the hang after several hours of running, nor the above stacks. Did you make a change which would have gotten rid of the oops from comments 4 and 5? I have yes. Apologies for not making that clear. OK -- In that case I'll mark this as fixed. Thanks! Updating version to the right level in the defects. Sorry for the storm. |