189605 – NULL pointer dereference during a cman join

Bug 189605 - NULL pointer dereference during a cman join

Summary: NULL pointer dereference during a cman join

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Cluster Suite
Classification:	Retired
Component:	cman
Sub Component:
Version:	4
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	Christine Caulfield
QA Contact:	Cluster QE
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2006-04-21 15:39 UTC by Corey Marthaler
Modified:	2009-04-16 20:00 UTC (History)
CC List:	1 user (show)
Fixed In Version:	RHBA-2006-0559
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2006-08-10 21:32:48 UTC
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2006:0559	0	normal	SHIPPED_LIVE	cman-kernel bug fix update	2006-08-10 04:00:00 UTC

Description Corey Marthaler 2006-04-21 15:39:49 UTC

Description of problem:
Seven node cluster (link-01,2,3,4,5,7,8), 2 nodes were shot (link-05,8), and
when coming back up and attempting to cman join, link-08 paniced:


Unable to handle kernel NULL pointer dereference at 0000000000000000 RIP:
<ffffffffa0216b14>{:cman:proc_cluster_status+508}
PML4 1e43f067 PGD 1ea62067 PMD 0
Oops: 0000 [1] SMP
CPU 0
Modules linked in: gnbd(U) lock_nolock(U) gfs(U) lock_harness(U) dlm(U) cman(U)
md5 ipv6 parport_pc lp parport autofs4 i2c_dev i2c_core sunrpc ds yenta_socket
pcmcia_core button battery ac ohci_hcd hw_random tg3 floppy dm_snapshot dm_zero
dm_mirror ext3 jbd dm_mod qla2300 qla2xxx scsi_transport_fc mptscsih mptsas
mptspi mptfc mptscsi mptbase sd_mod scsi_mod
Pid: 6102, comm: cat Tainted: G   M  2.6.9-34.ELsmp
RIP: 0010:[<ffffffffa0216b14>] <ffffffffa0216b14>{:cman:proc_cluster_status+508}
RSP: 0018:000001001d59be78  EFLAGS: 00010202
RAX: 000000000000010f RBX: ffffffffa022c2e0 RCX: ffffffffa021eb9e
RDX: 000001003c01b400 RSI: 0000000000000004 RDI: 203a736573736572
RBP: 000000000000011f R08: 6464612065646f4e R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
R13: 000001001cb61000 R14: 0000000000000005 R15: 0000000000000007
FS:  0000002a95574b00(0000) GS:ffffffff804d7b00(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000000 CR3: 0000000000101000 CR4: 00000000000006e0
Process cat (pid: 6102, threadinfo 000001001d59a000, task 000001003bd917f0)
Stack: 0000000000000004 0000000000000000 00000100397c78c0 000001001cb61000
       0000000000000400 0000000000000400 000001001d59bf50 ffffffff801ab147
       000001003cd585c0 0000000000000000
Call Trace:<ffffffff801ab147>{proc_file_read+201} <ffffffff80177a83>{vfs_read+207}
       <ffffffff80177cda>{sys_read+69} <ffffffff801101c6>{system_call+126}


Code: 49 8b 04 24 0f 18 08 48 83 c2 18 49 39 d4 0f 84 1a 01 00 00
RIP <ffffffffa0216b14>{:cman:proc_cluster_status+508} RSP <000001001d59be78>
CR2: 0000000000000000
 <0>Kernel panic - not syncing: Oops


From another node still up in the cluster:
[root@link-01 ~]# cat /proc/cluster/nodes
Node  Votes Exp Sts  Name
   1    1    7   M   link-01
   2    1    7   M   link-03
   3    1    7   M   link-07
   4    1    7   X   link-08
   5    1    7   M   link-04
   6    1    7   X   link-05
   7    1    7   M   link-02
[root@link-01 ~]# cat /proc/cluster/services
Service          Name                              GID LID State     Code
Fence Domain:    "default"                           1   2 run       -
[2 1 5 7 3]

DLM Lock Space:  "clvmd"                             2   3 run       -
[2 1 5 7 3]

DLM Lock Space:  "link0"                             5   4 run       -
[2 1 5 7 3]

DLM Lock Space:  "link1"                             7   6 run       -
[2 1 5 7 3]

GFS Mount Group: "link0"                             6   5 run       -
[2 1 5 7 3]

GFS Mount Group: "link1"                             8   7 run       -
[2 1 5 7 3]



Version-Release number of selected component (if applicable):
Kernel 2.6.9-34.ELsmp on an x86_64
CMAN 2.6.9-43.8 (built Feb 26 2006 21:05:40) installed
DLM 2.6.9-41.7 (built Feb 26 2006 21:32:11) installed
Lock_Harness 2.6.9-49.1 (built Feb 26 2006 21:50:40) installed
GFS 2.6.9-49.1 (built Feb 26 2006 21:50:56) installed
Lock_Nolock 2.6.9-49.1 (built Feb 26 2006 21:50:40) installed

Comment 1 Christine Caulfield 2006-04-24 12:27:47 UTC

I'm not totally convinced by this fix, but neither can I find any other suitable
explanation, so it's worth a try. AMD64 oopses are not that helpful either :(

RHEL4:
Checking in proc.c;
/cvs/cluster/cluster/cman-kernel/src/Attic/proc.c,v  <--  proc.c
new revision: 1.11.2.6; previous revision: 1.11.2.5
done

STABLE:
Checking in proc.c;
/cvs/cluster/cluster/cman-kernel/src/Attic/proc.c,v  <--  proc.c
new revision: 1.11.2.2.4.1.2.3; previous revision: 1.11.2.2.4.1.2.2
done

Comment 4 Red Hat Bugzilla 2006-08-10 21:32:48 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2006-0559.html

Note You need to log in before you can comment on or make changes to this bug.