Bug 131004

Summary:	cman_tool join/leave hangs after many iterations.
Product:	[Retired] Red Hat Cluster Suite	Reporter:	Dean Jansa <djansa>
Component:	gfs	Assignee:	Christine Caulfield <ccaulfie>
Status:	CLOSED CURRENTRELEASE	QA Contact:	GFS Bugs <gfs-bugs>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	4
Target Milestone:	---
Target Release:	---
Hardware:	i686
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2004-09-21 14:22:14 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Dean Jansa 2004-08-26 16:10:59 UTC

Running: 
 
while true 
do 
  sleep 1 
  cman_tool leave 
  sleep 1 
  cman_tool join 
done 
 
After several iterations the above loop will hang, at which point 
all further attmempts to run cman_tool fail with the error:  can't 
open cluster socket: Device or resource busy. 
(which makes sense, as there is a cman_tool hung.) 
 
[root@tank-06 root]# uname -ar 
Linux tank-06.lab.msp.redhat.com 2.6.8.1 #1 SMP Tue Aug 24 14:47:46 
CDT 2004 i686 i686 i386 GNU/Linux 
 
 
Version-Release number of selected component (if applicable): 
 
CMAN <CVS> (built Aug 24 2004 14:55:41) installed  
 
How reproducible: 
Always

Comment 1 Christine Caulfield 2004-09-06 10:20:07 UTC

Annoyingly I can't reproduce this on my UP or SMP systems unless it
needs many more iterations than I infer from "several".

Can you tell me whether cman_tool hangs on join or leave (latest dmesg
output would also be handy) and also which cman daemons are running at
the point of the hang ?

Also, how many other machines are in the cluster at the time ?

Comment 2 Christine Caulfield 2004-09-09 10:32:51 UTC

OK, I've managed to reproduce this now.

Comment 3 Christine Caulfield 2004-09-10 11:53:15 UTC

I've added a missing wake call which might have been causing this.
It's probably work testing again for that, it seems OK here for a few
hours now.

cman really needs to be updated to use the new kthreads interface at
some point.

Comment 4 Dean Jansa 2004-09-13 19:38:12 UTC

Patrick -- 
 
Running another test I hit the following.  I was doing leave/joins 
as well as GETMEMBERS ioctls when this happend.  Looks like it may 
be related to the above issue?  If not, let's open a new bug. 
 
6 node cluster, taking 1 or 2 members out and then checking everyone 
sees the same cluster view.  tank-03 did a join to rejoin the 
cluster and: 
 
 Unable to handle kernel NULL pointer dereference at virtual address 
00000001 
 printing eip: 
00000001 
*pde = 00000000 
Oops: 0000 [#1] 
SMP 
Modules linked in: gnbd lock_gulm lock_nolock lock_dlm dlm cman gfs 
lock_harness ipv6 parport_pc lp parport autofs4 sunrpc e1000 floppy 
sg microcode dm_mod uhci_hcd ehci_hcd button battery asus_acpi ac 
ext3 jbd qla2300 qla2xxx scsi_transport_fc sd_mod scsi_mod 
CPU:    0 
EIP:    0060:[<00000001>]    Not tainted 
EFLAGS: 00010087   (2.6.8.1) 
EIP is at 0x1 
eax: f5b83e50   ebx: f5b83e50   ecx: 00000000   edx: 00000001 
esi: 00000000   edi: 8398d040   ebp: f5b83eec   esp: f5b83ecc 
ds: 007b   es: 007b   ss: 0068 
Process cman_comms (pid: 3873, threadinfo=f5b82000 task=f5c8f310) 
Stack: c011eff7 00000000 f8a83684 00000001 00000001 00000282 
f8a83680 00000014 
       f5b83f04 c011f05f 00000000 00000000 00000000 f759fd80 
f5f68800 f8a6701e 
       00000000 f8a67afd f5c8f310 c2019ca0 c201a600 f759fd80 
f5f68800 00000014 
Call Trace: 
 [<c011eff7>] __wake_up_common+0x37/0x70 
 [<c011f05f>] __wake_up+0x2f/0x40 
 [<f8a6701e>] unjam+0x1e/0x40 [cman] 
 [<f8a67afd>] send_to_userport+0xad/0x560 [cman] 
 [<f8a671bc>] receive_message+0xcc/0xf0 [cman] 
 [<f8a67359>] cluster_kthread+0x179/0x320 [cman] 
 [<c011efb0>] default_wake_function+0x0/0x10 
 [<f8a671e0>] cluster_kthread+0x0/0x320 [cman] 
 [<c01042b5>] kernel_thread_helper+0x5/0x10 
Code:  Bad EIP value.

Comment 5 Dean Jansa 2004-09-13 21:13:11 UTC

I hit another stack running the same thing, single or multi nodes 
leave cluster, all other nodes check that the agree on cluster view, 
rejoin.  repeat....  New stack: (Triggered after a cman_tool leave): 
 
Unable to handle kernel NULL pointer dereference at virtual address 
00000001 
 printing eip: 
00000001 
*pde = 00000000 
Oops: 0000 [#1] 
SMP 
Modules linked in: cman ipv6 parport_pc lp parport autofs4 sunrpc 
e1000 floppy sg microcode dm_mod uhci_hcd ehci_hcd button battery 
asus_acpi ac ext3 jbd qla2300 qla2xxx scsi_transport_fc sd_mod 
scsi_mod 
CPU:    0 
EIP:    0060:[<00000001>]    Not tainted 
EFLAGS: 00010086   (2.6.8.1) 
EIP is at 0x1 
eax: fffffff2   ebx: c011eff7   ecx: 00000000   edx: fffffff2 
esi: 00000000   edi: f8a43684   ebp: 00000001   esp: f550ff60 
ds: 007b   es: 007b   ss: 0068 
Process cman_comms (pid: 2331, threadinfo=f550e000 task=c2362b70) 
Stack: 00000286 f8a43680 f7102800 f550ff84 c011f05f 00000000 
00000000 f727cc80 
       f8a449c8 f550e000 f8a2701e 00000000 f8a2ab70 f8a39fc4 
0136001e f727cc80 
       f8a449c8 f7102800 f550e000 f8a27443 f8a3b8e0 c2362b70 
0000001f 00000000 
Call Trace: 
 [<c011f05f>] __wake_up+0x2f/0x40 
 [<f8a2701e>] unjam+0x1e/0x40 [cman] 
 [<f8a2ab70>] node_shutdown+0x20/0x330 [cman] 
 [<f8a27443>] cluster_kthread+0x263/0x320 [cman] 
 [<c011efb0>] default_wake_function+0x0/0x10 
 [<f8a271e0>] cluster_kthread+0x0/0x320 [cman] 
 [<c01042b5>] kernel_thread_helper+0x5/0x10 
Code:  Bad EIP value.

Comment 6 Christine Caulfield 2004-09-14 09:25:17 UTC

It's definitely a different bug. If this morning's checkin doesn't fix
it then raise a new bug report.

Comment 7 Christine Caulfield 2004-09-20 15:09:57 UTC

I think this is worth a retest, bear in mind my previous comment...

Comment 8 Dean Jansa 2004-09-20 21:01:16 UTC

I have not seen the hang after several hours of running, nor 
the above stacks.  Did you make a change which would have gotten rid 
of the oops from comments 4 and 5?

Comment 9 Christine Caulfield 2004-09-21 06:51:50 UTC

I have yes. 

Apologies for not making that clear.

Comment 10 Dean Jansa 2004-09-21 14:22:14 UTC

OK -- In that case I'll mark this as fixed.  Thanks!

Comment 11 Kiersten (Kerri) Anderson 2004-11-16 19:08:28 UTC

Updating version to the right level in the defects.  Sorry for the storm.