Description of problem: Seven node cluster (link-01,2,3,4,5,7,8), 2 nodes were shot (link-05,8), and when coming back up and attempting to cman join, link-08 paniced: Unable to handle kernel NULL pointer dereference at 0000000000000000 RIP: <ffffffffa0216b14>{:cman:proc_cluster_status+508} PML4 1e43f067 PGD 1ea62067 PMD 0 Oops: 0000 [1] SMP CPU 0 Modules linked in: gnbd(U) lock_nolock(U) gfs(U) lock_harness(U) dlm(U) cman(U) md5 ipv6 parport_pc lp parport autofs4 i2c_dev i2c_core sunrpc ds yenta_socket pcmcia_core button battery ac ohci_hcd hw_random tg3 floppy dm_snapshot dm_zero dm_mirror ext3 jbd dm_mod qla2300 qla2xxx scsi_transport_fc mptscsih mptsas mptspi mptfc mptscsi mptbase sd_mod scsi_mod Pid: 6102, comm: cat Tainted: G M 2.6.9-34.ELsmp RIP: 0010:[<ffffffffa0216b14>] <ffffffffa0216b14>{:cman:proc_cluster_status+508} RSP: 0018:000001001d59be78 EFLAGS: 00010202 RAX: 000000000000010f RBX: ffffffffa022c2e0 RCX: ffffffffa021eb9e RDX: 000001003c01b400 RSI: 0000000000000004 RDI: 203a736573736572 RBP: 000000000000011f R08: 6464612065646f4e R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 R13: 000001001cb61000 R14: 0000000000000005 R15: 0000000000000007 FS: 0000002a95574b00(0000) GS:ffffffff804d7b00(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 0000000000000000 CR3: 0000000000101000 CR4: 00000000000006e0 Process cat (pid: 6102, threadinfo 000001001d59a000, task 000001003bd917f0) Stack: 0000000000000004 0000000000000000 00000100397c78c0 000001001cb61000 0000000000000400 0000000000000400 000001001d59bf50 ffffffff801ab147 000001003cd585c0 0000000000000000 Call Trace:<ffffffff801ab147>{proc_file_read+201} <ffffffff80177a83>{vfs_read+207} <ffffffff80177cda>{sys_read+69} <ffffffff801101c6>{system_call+126} Code: 49 8b 04 24 0f 18 08 48 83 c2 18 49 39 d4 0f 84 1a 01 00 00 RIP <ffffffffa0216b14>{:cman:proc_cluster_status+508} RSP <000001001d59be78> CR2: 0000000000000000 <0>Kernel panic - not syncing: Oops From another node still up in the cluster: [root@link-01 ~]# cat /proc/cluster/nodes Node Votes Exp Sts Name 1 1 7 M link-01 2 1 7 M link-03 3 1 7 M link-07 4 1 7 X link-08 5 1 7 M link-04 6 1 7 X link-05 7 1 7 M link-02 [root@link-01 ~]# cat /proc/cluster/services Service Name GID LID State Code Fence Domain: "default" 1 2 run - [2 1 5 7 3] DLM Lock Space: "clvmd" 2 3 run - [2 1 5 7 3] DLM Lock Space: "link0" 5 4 run - [2 1 5 7 3] DLM Lock Space: "link1" 7 6 run - [2 1 5 7 3] GFS Mount Group: "link0" 6 5 run - [2 1 5 7 3] GFS Mount Group: "link1" 8 7 run - [2 1 5 7 3] Version-Release number of selected component (if applicable): Kernel 2.6.9-34.ELsmp on an x86_64 CMAN 2.6.9-43.8 (built Feb 26 2006 21:05:40) installed DLM 2.6.9-41.7 (built Feb 26 2006 21:32:11) installed Lock_Harness 2.6.9-49.1 (built Feb 26 2006 21:50:40) installed GFS 2.6.9-49.1 (built Feb 26 2006 21:50:56) installed Lock_Nolock 2.6.9-49.1 (built Feb 26 2006 21:50:40) installed
I'm not totally convinced by this fix, but neither can I find any other suitable explanation, so it's worth a try. AMD64 oopses are not that helpful either :( RHEL4: Checking in proc.c; /cvs/cluster/cluster/cman-kernel/src/Attic/proc.c,v <-- proc.c new revision: 1.11.2.6; previous revision: 1.11.2.5 done STABLE: Checking in proc.c; /cvs/cluster/cluster/cman-kernel/src/Attic/proc.c,v <-- proc.c new revision: 1.11.2.2.4.1.2.3; previous revision: 1.11.2.2.4.1.2.2 done
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2006-0559.html