131004 – cman_tool join/leave hangs after many iterations.

Bug 131004 - cman_tool join/leave hangs after many iterations.

Summary: cman_tool join/leave hangs after many iterations.

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Cluster Suite
Classification:	Retired
Component:	gfs
Sub Component:
Version:	4
Hardware:	i686
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	Christine Caulfield
QA Contact:	GFS Bugs
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2004-08-26 16:10 UTC by Dean Jansa
Modified:	2010-01-12 02:57 UTC (History)
CC List:	0 users
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2004-09-21 14:22:14 UTC
Embargoed:

Attachments	(Terms of Use)

Description Dean Jansa 2004-08-26 16:10:59 UTC

Running: 
 
while true 
do 
  sleep 1 
  cman_tool leave 
  sleep 1 
  cman_tool join 
done 
 
After several iterations the above loop will hang, at which point 
all further attmempts to run cman_tool fail with the error:  can't 
open cluster socket: Device or resource busy. 
(which makes sense, as there is a cman_tool hung.) 
 
[root@tank-06 root]# uname -ar 
Linux tank-06.lab.msp.redhat.com 2.6.8.1 #1 SMP Tue Aug 24 14:47:46 
CDT 2004 i686 i686 i386 GNU/Linux 
 
 
Version-Release number of selected component (if applicable): 
 
CMAN <CVS> (built Aug 24 2004 14:55:41) installed  
 
How reproducible: 
Always

Comment 1 Christine Caulfield 2004-09-06 10:20:07 UTC

Annoyingly I can't reproduce this on my UP or SMP systems unless it
needs many more iterations than I infer from "several".

Can you tell me whether cman_tool hangs on join or leave (latest dmesg
output would also be handy) and also which cman daemons are running at
the point of the hang ?

Also, how many other machines are in the cluster at the time ?

Comment 2 Christine Caulfield 2004-09-09 10:32:51 UTC

OK, I've managed to reproduce this now.

Comment 3 Christine Caulfield 2004-09-10 11:53:15 UTC

I've added a missing wake call which might have been causing this.
It's probably work testing again for that, it seems OK here for a few
hours now.

cman really needs to be updated to use the new kthreads interface at
some point.

Comment 4 Dean Jansa 2004-09-13 19:38:12 UTC

Patrick -- 
 
Running another test I hit the following.  I was doing leave/joins 
as well as GETMEMBERS ioctls when this happend.  Looks like it may 
be related to the above issue?  If not, let's open a new bug. 
 
6 node cluster, taking 1 or 2 members out and then checking everyone 
sees the same cluster view.  tank-03 did a join to rejoin the 
cluster and: 
 
 Unable to handle kernel NULL pointer dereference at virtual address 
00000001 
 printing eip: 
00000001 
*pde = 00000000 
Oops: 0000 [#1] 
SMP 
Modules linked in: gnbd lock_gulm lock_nolock lock_dlm dlm cman gfs 
lock_harness ipv6 parport_pc lp parport autofs4 sunrpc e1000 floppy 
sg microcode dm_mod uhci_hcd ehci_hcd button battery asus_acpi ac 
ext3 jbd qla2300 qla2xxx scsi_transport_fc sd_mod scsi_mod 
CPU:    0 
EIP:    0060:[<00000001>]    Not tainted 
EFLAGS: 00010087   (2.6.8.1) 
EIP is at 0x1 
eax: f5b83e50   ebx: f5b83e50   ecx: 00000000   edx: 00000001 
esi: 00000000   edi: 8398d040   ebp: f5b83eec   esp: f5b83ecc 
ds: 007b   es: 007b   ss: 0068 
Process cman_comms (pid: 3873, threadinfo=f5b82000 task=f5c8f310) 
Stack: c011eff7 00000000 f8a83684 00000001 00000001 00000282 
f8a83680 00000014 
       f5b83f04 c011f05f 00000000 00000000 00000000 f759fd80 
f5f68800 f8a6701e 
       00000000 f8a67afd f5c8f310 c2019ca0 c201a600 f759fd80 
f5f68800 00000014 
Call Trace: 
 [<c011eff7>] __wake_up_common+0x37/0x70 
 [<c011f05f>] __wake_up+0x2f/0x40 
 [<f8a6701e>] unjam+0x1e/0x40 [cman] 
 [<f8a67afd>] send_to_userport+0xad/0x560 [cman] 
 [<f8a671bc>] receive_message+0xcc/0xf0 [cman] 
 [<f8a67359>] cluster_kthread+0x179/0x320 [cman] 
 [<c011efb0>] default_wake_function+0x0/0x10 
 [<f8a671e0>] cluster_kthread+0x0/0x320 [cman] 
 [<c01042b5>] kernel_thread_helper+0x5/0x10 
Code:  Bad EIP value.

Comment 5 Dean Jansa 2004-09-13 21:13:11 UTC

I hit another stack running the same thing, single or multi nodes 
leave cluster, all other nodes check that the agree on cluster view, 
rejoin.  repeat....  New stack: (Triggered after a cman_tool leave): 
 
Unable to handle kernel NULL pointer dereference at virtual address 
00000001 
 printing eip: 
00000001 
*pde = 00000000 
Oops: 0000 [#1] 
SMP 
Modules linked in: cman ipv6 parport_pc lp parport autofs4 sunrpc 
e1000 floppy sg microcode dm_mod uhci_hcd ehci_hcd button battery 
asus_acpi ac ext3 jbd qla2300 qla2xxx scsi_transport_fc sd_mod 
scsi_mod 
CPU:    0 
EIP:    0060:[<00000001>]    Not tainted 
EFLAGS: 00010086   (2.6.8.1) 
EIP is at 0x1 
eax: fffffff2   ebx: c011eff7   ecx: 00000000   edx: fffffff2 
esi: 00000000   edi: f8a43684   ebp: 00000001   esp: f550ff60 
ds: 007b   es: 007b   ss: 0068 
Process cman_comms (pid: 2331, threadinfo=f550e000 task=c2362b70) 
Stack: 00000286 f8a43680 f7102800 f550ff84 c011f05f 00000000 
00000000 f727cc80 
       f8a449c8 f550e000 f8a2701e 00000000 f8a2ab70 f8a39fc4 
0136001e f727cc80 
       f8a449c8 f7102800 f550e000 f8a27443 f8a3b8e0 c2362b70 
0000001f 00000000 
Call Trace: 
 [<c011f05f>] __wake_up+0x2f/0x40 
 [<f8a2701e>] unjam+0x1e/0x40 [cman] 
 [<f8a2ab70>] node_shutdown+0x20/0x330 [cman] 
 [<f8a27443>] cluster_kthread+0x263/0x320 [cman] 
 [<c011efb0>] default_wake_function+0x0/0x10 
 [<f8a271e0>] cluster_kthread+0x0/0x320 [cman] 
 [<c01042b5>] kernel_thread_helper+0x5/0x10 
Code:  Bad EIP value.

Comment 6 Christine Caulfield 2004-09-14 09:25:17 UTC

It's definitely a different bug. If this morning's checkin doesn't fix
it then raise a new bug report.

Comment 7 Christine Caulfield 2004-09-20 15:09:57 UTC

I think this is worth a retest, bear in mind my previous comment...

Comment 8 Dean Jansa 2004-09-20 21:01:16 UTC

I have not seen the hang after several hours of running, nor 
the above stacks.  Did you make a change which would have gotten rid 
of the oops from comments 4 and 5?

Comment 9 Christine Caulfield 2004-09-21 06:51:50 UTC

I have yes. 

Apologies for not making that clear.

Comment 10 Dean Jansa 2004-09-21 14:22:14 UTC

OK -- In that case I'll mark this as fixed.  Thanks!

Comment 11 Kiersten (Kerri) Anderson 2004-11-16 19:08:28 UTC

Updating version to the right level in the defects.  Sorry for the storm.

Note You need to log in before you can comment on or make changes to this bug.