Bug 206212

Summary: kernel oops in cman:process_startdone_barrier_new durring the attempt of many mounts
Product: [Retired] Red Hat Cluster Suite Reporter: Corey Marthaler <cmarthal>
Component: cmanAssignee: Christine Caulfield <ccaulfie>
Status: CLOSED ERRATA QA Contact: Cluster QE <mspqa-list>
Severity: medium Docs Contact:
Priority: medium    
Version: 4CC: cluster-maint, teigland
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: RHBA-2007-0135 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2007-05-10 21:22:04 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Corey Marthaler 2006-09-12 22:40:48 UTC
Description of problem:
After a fresh reboot of all 4 nodes in the taft cluster, I brought them back up
and attempted to mount all 60 GFS on all four nodes simultaneously. This caused
taft-01 to panic.

Unable to handle kernel NULL pointer dereference at 0000000000000020 RIP:
<ffffffffa021c43a>{:cman:process_startdone_barrier_new+7}
PML4 20c02e067 PGD 20db06067 PMD 0
Oops: 0002 [1] SMP
CPU 2
Modules linked in: lock_dlm(U) gfs(U) lock_harness(U) dlm(U) cman(U) md5 ipv6
parport_pc lp parpord
Pid: 4706, comm: cman_serviced Not tainted 2.6.9-42.0.2.ELsmp
RIP: 0010:[<ffffffffa021c43a>]
<ffffffffa021c43a>{:cman:process_startdone_barrier_new+7}
RSP: 0018:0000010212ea3f00  EFLAGS: 00010246
RAX: 0000000000000001 RBX: 0000010217b00480 RCX: 00000100dfde5800
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00000100dfde5800
RBP: ffffffffa021c831 R08: 0000010212ea2000 R09: 0000000000000000
R10: 0000000000000000 R11: 000000000000000a R12: 00000102128a97c8
R13: 00000000fffffffc R14: 00000102128a97b8 R15: ffffffff8014b4f0
FS:  0000000000000000(0000) GS:ffffffff804e5280(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 0000000000000020 CR3: 00000000dff9e000 CR4: 00000000000006e0
Process cman_serviced (pid: 4706, threadinfo 0000010212ea2000, task
0000010215cb47f0)
Stack: ffffffffa021c5aa 0000000000000000 ffffffffa021c873 0000000000000018
       ffffffff8014b4c7 ffffffffffffffff 00000102128a97b8 00000102128a9750
       00000100dff613c0 0000000000000216
Call Trace:<ffffffffa021c5aa>{:cman:process_barriers+146}
<ffffffffa021c873>{:cman:serviced+66}
       <ffffffff8014b4c7>{kthread+200} <ffffffff80110f47>{child_rip+8}
       <ffffffff8014b4f0>{keventd_create_kthread+0} <ffffffff8014b3ff>{kthread+0}
       <ffffffff80110f3f>{child_rip+0}

Code: f0 0f ba 72 20 06 19 c0 85 c0 75 12 48 8b 7a 18 89 f2 48 c7
RIP <ffffffffa021c43a>{:cman:process_startdone_barrier_new+7} RSP <0000010212ea3f00>
CR2: 0000000000000020
 <0>Kernel panic - not syncing: Oops

Version-Release number of selected component (if applicable):
[root@taft-01 ~]# uname -ar
Linux taft-01 2.6.9-42.0.2.ELsmp #1 SMP Thu Aug 17 17:57:31 EDT 2006 x86_64
x86_64 x86_64 GNU/Linux
[root@taft-01 ~]# rpm -q cman
cman-1.0.11-0

Comment 1 David Teigland 2006-09-13 14:50:20 UTC
The immediate cause of this is pretty simple to fix, the sg passed
to process_startdone_barrier_new() has a NULL sevent in some cases
(the deeper question is why).  The function immediately references
sev->flags without checking for NULL which leads to the oops above.
We now just print an error and return if the sevent is NULL instead
of oopsing.

cvs commit: Examining .
Checking in sm_barrier.c;
/cvs/cluster/cluster/cman-kernel/src/Attic/sm_barrier.c,v  <--  sm_barrier.c
new revision: 1.1.2.2; previous revision: 1.1.2.1
done


Comment 2 David Teigland 2006-09-19 16:32:55 UTC
The reason process_startdone_barrier_new() is being called with
a NULL sevent is bug 206193.


Comment 3 Corey Marthaler 2007-04-23 20:48:17 UTC
Just mounted 65 gfs filesytems simultaneously on a 4 node cluster. Marking verified.

Comment 5 Red Hat Bugzilla 2007-05-10 21:22:04 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2007-0135.html