Description of problem: After a fresh reboot of all 4 nodes in the taft cluster, I brought them back up and attempted to mount all 60 GFS on all four nodes simultaneously. This caused taft-01 to panic. Unable to handle kernel NULL pointer dereference at 0000000000000020 RIP: <ffffffffa021c43a>{:cman:process_startdone_barrier_new+7} PML4 20c02e067 PGD 20db06067 PMD 0 Oops: 0002 [1] SMP CPU 2 Modules linked in: lock_dlm(U) gfs(U) lock_harness(U) dlm(U) cman(U) md5 ipv6 parport_pc lp parpord Pid: 4706, comm: cman_serviced Not tainted 2.6.9-42.0.2.ELsmp RIP: 0010:[<ffffffffa021c43a>] <ffffffffa021c43a>{:cman:process_startdone_barrier_new+7} RSP: 0018:0000010212ea3f00 EFLAGS: 00010246 RAX: 0000000000000001 RBX: 0000010217b00480 RCX: 00000100dfde5800 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00000100dfde5800 RBP: ffffffffa021c831 R08: 0000010212ea2000 R09: 0000000000000000 R10: 0000000000000000 R11: 000000000000000a R12: 00000102128a97c8 R13: 00000000fffffffc R14: 00000102128a97b8 R15: ffffffff8014b4f0 FS: 0000000000000000(0000) GS:ffffffff804e5280(0000) knlGS:0000000000000000 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b CR2: 0000000000000020 CR3: 00000000dff9e000 CR4: 00000000000006e0 Process cman_serviced (pid: 4706, threadinfo 0000010212ea2000, task 0000010215cb47f0) Stack: ffffffffa021c5aa 0000000000000000 ffffffffa021c873 0000000000000018 ffffffff8014b4c7 ffffffffffffffff 00000102128a97b8 00000102128a9750 00000100dff613c0 0000000000000216 Call Trace:<ffffffffa021c5aa>{:cman:process_barriers+146} <ffffffffa021c873>{:cman:serviced+66} <ffffffff8014b4c7>{kthread+200} <ffffffff80110f47>{child_rip+8} <ffffffff8014b4f0>{keventd_create_kthread+0} <ffffffff8014b3ff>{kthread+0} <ffffffff80110f3f>{child_rip+0} Code: f0 0f ba 72 20 06 19 c0 85 c0 75 12 48 8b 7a 18 89 f2 48 c7 RIP <ffffffffa021c43a>{:cman:process_startdone_barrier_new+7} RSP <0000010212ea3f00> CR2: 0000000000000020 <0>Kernel panic - not syncing: Oops Version-Release number of selected component (if applicable): [root@taft-01 ~]# uname -ar Linux taft-01 2.6.9-42.0.2.ELsmp #1 SMP Thu Aug 17 17:57:31 EDT 2006 x86_64 x86_64 x86_64 GNU/Linux [root@taft-01 ~]# rpm -q cman cman-1.0.11-0
The immediate cause of this is pretty simple to fix, the sg passed to process_startdone_barrier_new() has a NULL sevent in some cases (the deeper question is why). The function immediately references sev->flags without checking for NULL which leads to the oops above. We now just print an error and return if the sevent is NULL instead of oopsing. cvs commit: Examining . Checking in sm_barrier.c; /cvs/cluster/cluster/cman-kernel/src/Attic/sm_barrier.c,v <-- sm_barrier.c new revision: 1.1.2.2; previous revision: 1.1.2.1 done
The reason process_startdone_barrier_new() is being called with a NULL sevent is bug 206193.
Just mounted 65 gfs filesytems simultaneously on a 4 node cluster. Marking verified.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2007-0135.html