142865 – `cat /proc/cluster/status` segfaulted and caused a kernel oops.

Bug 142865 - `cat /proc/cluster/status` segfaulted and caused a kernel oops.

Summary: `cat /proc/cluster/status` segfaulted and caused a kernel oops.

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Cluster Suite
Classification:	Retired
Component:	cman
Sub Component:
Version:	4
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	Christine Caulfield
QA Contact:	Cluster QE
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2004-12-14 19:48 UTC by Adam "mantis" Manthei
Modified:	2009-04-24 14:29 UTC (History)
CC List:	1 user (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2009-04-24 14:29:44 UTC
Embargoed:

Attachments	(Terms of Use)

Description Adam "mantis" Manthei 2004-12-14 19:48:01 UTC

Description of problem:

I started `cman_tool join` on all the nodes in the cluster (I have 8 nodes)  My
environment is setup to start cman on startup.  I rebooted all my nodes at about
the same time.  A few of the nodes were unable to join the cluster (I've not yet
been able to figure out how to get any useful diagnostics... suggestions are
welcome) while trying to figure out why one of the nodes was not joining, I did
`cat /proc/cluster/status` and it segfaulted due to a kernel oops.

Is this related to bug #142853?

Output from console:

CMAN: node trin-09 is not responding - removing from the cluster
CMAN: node trin-09 is not responding - removing from the cluster
CMAN: node trin-04 is not responding - removing from the cluster

Unable to handle kernel paging request at virtual address 0000dd86
 printing eip:
c01d821c
*pde = 00000000
Oops: 0000 [#1]
Modules linked in: parport_pc lp parport autofs4 i2c_dev i2c_core cman(U) md5
ipv6 sunrpc dm_mod button battery ac uhci_hcd ehci_hcd hw_random e1000 floppy
ext3 jbd qla2300 qla2xxx scsi_transport_fc sd_mod scsi_mod
CPU:    0
EIP:    0060:[<c01d821c>]    Not tainted VLI
EFLAGS: 00010297   (2.6.9-1.906_EL)
EIP is at vsnprintf+0x2c7/0x488
eax: 0000dd86   ebx: de2b3e64   ecx: 0000dd86   edx: fffffffe
esi: de2b3de4   edi: 0000000a   ebp: ffffffff   esp: de2b3d9c
ds: 007b   es: 007b   ss: 0068
Process cat (pid: 8855, threadinfo=de2b3000 task=dbe7d830)
Stack: ffffffff ffffffff 00000000 ffffffff 21d4c1b8 de2b3e48 e0307815 de2b3e48
       00000000 00000000 00000018 c01d843f de2b3de0 c01d8452 e02ff854 de2b3e48
       e03077f8 0000dd86 ffffffff 00000002 ffffffff 242c4000 dbd3c000 e030797e
Call Trace:
 [<c01d843f>] vsprintf+0xd/0xf
 [<c01d8452>] sprintf+0x11/0x12
 [<e02ff854>] membership_state+0x93/0x9f [cman]
 [<e02ffa73>] proc_cluster_status+0x33/0x2a0 [cman]
 [<c019c3e1>] proc_alloc_inode+0x3c/0x54
 [<c017e6da>] alloc_inode+0xf6/0x17f
 [<c017c9ef>] d_instantiate+0x12e/0x131
 [<c019fd82>] proc_lookup+0x1a0/0x1aa
 [<c01710a1>] real_lookup+0x73/0xde
 [<c01713d1>] do_lookup+0x56/0x8f
 [<c017ae23>] dput+0x33/0x417
 [<c01720de>] link_path_walk+0xcd4/0xd8c
 [<c016d0f9>] cp_new_stat64+0x124/0x139
 [<c0145c9f>] buffered_rmqueue+0x1c4/0x1e7
 [<c0145d76>] __alloc_pages+0xb4/0x298
 [<c019f585>] proc_file_read+0x97/0x225
 [<c01621fe>] vfs_read+0xb6/0xe2
 [<c0162411>] sys_read+0x3c/0x62
 [<c0301bfb>] syscall_call+0x7/0xb
Code: 01 00 00 3b 5c 24 0c 77 f0 c6 03 20 eb eb 89 f0 83 c6 04 8b 08 b8 0f a1 31
c0 8b 54 24 04 81 f9 ff 0f 00 00 0f 46 c8 89 c8 eb 06 <80> 38 00 74 07 40 4a 83
fa ff 75 f4 29 c8 f6 44 24 08 10 89 c7
 <4>CMAN: node trin-09 is not responding - removing from the cluster
CMAN: nmembers in HELLO message from 5 does not match our view (got 4, exp 5)
CMAN: too many transition restarts - will die
CMAN: we are leaving the cluster. Reason is 5


Version-Release number of selected component (if applicable):
cman-kernel-2.6.9-3.3
cman-1.0-0.pre5.0

How reproducible:
I've not really tried yet.

Steps to Reproduce:
I've not tried to reproduce it yet

Comment 1 Christine Caulfield 2004-12-15 11:48:37 UTC

proc_cluster_status allocates a 255 byte buffer - that might be enough
to kill the stack under some circumstances. so I've fix that. Looking
at that code it's hard to see what else it could be.

Checking in proc.c;
/cvs/cluster/cluster/cman-kernel/src/proc.c,v  <--  proc.c
new revision: 1.10; previous revision: 1.9
done

Comment 2 Christine Caulfield 2009-04-24 14:29:44 UTC

As this has been in CVS (now git) since 2004 I think there's a good chance it's in a release!

Note You need to log in before you can comment on or make changes to this bug.