Bug 189605 - NULL pointer dereference during a cman join
Summary: NULL pointer dereference during a cman join
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Cluster Suite
Classification: Retired
Component: cman
Version: 4
Hardware: All
OS: Linux
medium
medium
Target Milestone: ---
Assignee: Christine Caulfield
QA Contact: Cluster QE
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2006-04-21 15:39 UTC by Corey Marthaler
Modified: 2009-04-16 20:00 UTC (History)
1 user (show)

Fixed In Version: RHBA-2006-0559
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2006-08-10 21:32:48 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2006:0559 0 normal SHIPPED_LIVE cman-kernel bug fix update 2006-08-10 04:00:00 UTC

Description Corey Marthaler 2006-04-21 15:39:49 UTC
Description of problem:
Seven node cluster (link-01,2,3,4,5,7,8), 2 nodes were shot (link-05,8), and
when coming back up and attempting to cman join, link-08 paniced:


Unable to handle kernel NULL pointer dereference at 0000000000000000 RIP:
<ffffffffa0216b14>{:cman:proc_cluster_status+508}
PML4 1e43f067 PGD 1ea62067 PMD 0
Oops: 0000 [1] SMP
CPU 0
Modules linked in: gnbd(U) lock_nolock(U) gfs(U) lock_harness(U) dlm(U) cman(U)
md5 ipv6 parport_pc lp parport autofs4 i2c_dev i2c_core sunrpc ds yenta_socket
pcmcia_core button battery ac ohci_hcd hw_random tg3 floppy dm_snapshot dm_zero
dm_mirror ext3 jbd dm_mod qla2300 qla2xxx scsi_transport_fc mptscsih mptsas
mptspi mptfc mptscsi mptbase sd_mod scsi_mod
Pid: 6102, comm: cat Tainted: G   M  2.6.9-34.ELsmp
RIP: 0010:[<ffffffffa0216b14>] <ffffffffa0216b14>{:cman:proc_cluster_status+508}
RSP: 0018:000001001d59be78  EFLAGS: 00010202
RAX: 000000000000010f RBX: ffffffffa022c2e0 RCX: ffffffffa021eb9e
RDX: 000001003c01b400 RSI: 0000000000000004 RDI: 203a736573736572
RBP: 000000000000011f R08: 6464612065646f4e R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
R13: 000001001cb61000 R14: 0000000000000005 R15: 0000000000000007
FS:  0000002a95574b00(0000) GS:ffffffff804d7b00(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000000 CR3: 0000000000101000 CR4: 00000000000006e0
Process cat (pid: 6102, threadinfo 000001001d59a000, task 000001003bd917f0)
Stack: 0000000000000004 0000000000000000 00000100397c78c0 000001001cb61000
       0000000000000400 0000000000000400 000001001d59bf50 ffffffff801ab147
       000001003cd585c0 0000000000000000
Call Trace:<ffffffff801ab147>{proc_file_read+201} <ffffffff80177a83>{vfs_read+207}
       <ffffffff80177cda>{sys_read+69} <ffffffff801101c6>{system_call+126}


Code: 49 8b 04 24 0f 18 08 48 83 c2 18 49 39 d4 0f 84 1a 01 00 00
RIP <ffffffffa0216b14>{:cman:proc_cluster_status+508} RSP <000001001d59be78>
CR2: 0000000000000000
 <0>Kernel panic - not syncing: Oops


From another node still up in the cluster:
[root@link-01 ~]# cat /proc/cluster/nodes
Node  Votes Exp Sts  Name
   1    1    7   M   link-01
   2    1    7   M   link-03
   3    1    7   M   link-07
   4    1    7   X   link-08
   5    1    7   M   link-04
   6    1    7   X   link-05
   7    1    7   M   link-02
[root@link-01 ~]# cat /proc/cluster/services
Service          Name                              GID LID State     Code
Fence Domain:    "default"                           1   2 run       -
[2 1 5 7 3]

DLM Lock Space:  "clvmd"                             2   3 run       -
[2 1 5 7 3]

DLM Lock Space:  "link0"                             5   4 run       -
[2 1 5 7 3]

DLM Lock Space:  "link1"                             7   6 run       -
[2 1 5 7 3]

GFS Mount Group: "link0"                             6   5 run       -
[2 1 5 7 3]

GFS Mount Group: "link1"                             8   7 run       -
[2 1 5 7 3]



Version-Release number of selected component (if applicable):
Kernel 2.6.9-34.ELsmp on an x86_64
CMAN 2.6.9-43.8 (built Feb 26 2006 21:05:40) installed
DLM 2.6.9-41.7 (built Feb 26 2006 21:32:11) installed
Lock_Harness 2.6.9-49.1 (built Feb 26 2006 21:50:40) installed
GFS 2.6.9-49.1 (built Feb 26 2006 21:50:56) installed
Lock_Nolock 2.6.9-49.1 (built Feb 26 2006 21:50:40) installed

Comment 1 Christine Caulfield 2006-04-24 12:27:47 UTC
I'm not totally convinced by this fix, but neither can I find any other suitable
explanation, so it's worth a try. AMD64 oopses are not that helpful either :(

RHEL4:
Checking in proc.c;
/cvs/cluster/cluster/cman-kernel/src/Attic/proc.c,v  <--  proc.c
new revision: 1.11.2.6; previous revision: 1.11.2.5
done

STABLE:
Checking in proc.c;
/cvs/cluster/cluster/cman-kernel/src/Attic/proc.c,v  <--  proc.c
new revision: 1.11.2.2.4.1.2.3; previous revision: 1.11.2.2.4.1.2.2
done


Comment 4 Red Hat Bugzilla 2006-08-10 21:32:48 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2006-0559.html



Note You need to log in before you can comment on or make changes to this bug.