Bug 206212 - kernel oops in cman:process_startdone_barrier_new durring the attempt of many mounts
kernel oops in cman:process_startdone_barrier_new durring the attempt of many...
Product: Red Hat Cluster Suite
Classification: Red Hat
Component: cman (Show other bugs)
All Linux
medium Severity medium
: ---
: ---
Assigned To: Christine Caulfield
Cluster QE
Depends On:
  Show dependency treegraph
Reported: 2006-09-12 18:40 EDT by Corey Marthaler
Modified: 2009-04-16 16:01 EDT (History)
2 users (show)

See Also:
Fixed In Version: RHBA-2007-0135
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Last Closed: 2007-05-10 17:22:04 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Attachments (Terms of Use)

  None (edit)
Description Corey Marthaler 2006-09-12 18:40:48 EDT
Description of problem:
After a fresh reboot of all 4 nodes in the taft cluster, I brought them back up
and attempted to mount all 60 GFS on all four nodes simultaneously. This caused
taft-01 to panic.

Unable to handle kernel NULL pointer dereference at 0000000000000020 RIP:
PML4 20c02e067 PGD 20db06067 PMD 0
Oops: 0002 [1] SMP
Modules linked in: lock_dlm(U) gfs(U) lock_harness(U) dlm(U) cman(U) md5 ipv6
parport_pc lp parpord
Pid: 4706, comm: cman_serviced Not tainted 2.6.9-42.0.2.ELsmp
RIP: 0010:[<ffffffffa021c43a>]
RSP: 0018:0000010212ea3f00  EFLAGS: 00010246
RAX: 0000000000000001 RBX: 0000010217b00480 RCX: 00000100dfde5800
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00000100dfde5800
RBP: ffffffffa021c831 R08: 0000010212ea2000 R09: 0000000000000000
R10: 0000000000000000 R11: 000000000000000a R12: 00000102128a97c8
R13: 00000000fffffffc R14: 00000102128a97b8 R15: ffffffff8014b4f0
FS:  0000000000000000(0000) GS:ffffffff804e5280(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 0000000000000020 CR3: 00000000dff9e000 CR4: 00000000000006e0
Process cman_serviced (pid: 4706, threadinfo 0000010212ea2000, task
Stack: ffffffffa021c5aa 0000000000000000 ffffffffa021c873 0000000000000018
       ffffffff8014b4c7 ffffffffffffffff 00000102128a97b8 00000102128a9750
       00000100dff613c0 0000000000000216
Call Trace:<ffffffffa021c5aa>{:cman:process_barriers+146}
       <ffffffff8014b4c7>{kthread+200} <ffffffff80110f47>{child_rip+8}
       <ffffffff8014b4f0>{keventd_create_kthread+0} <ffffffff8014b3ff>{kthread+0}

Code: f0 0f ba 72 20 06 19 c0 85 c0 75 12 48 8b 7a 18 89 f2 48 c7
RIP <ffffffffa021c43a>{:cman:process_startdone_barrier_new+7} RSP <0000010212ea3f00>
CR2: 0000000000000020
 <0>Kernel panic - not syncing: Oops

Version-Release number of selected component (if applicable):
[root@taft-01 ~]# uname -ar
Linux taft-01 2.6.9-42.0.2.ELsmp #1 SMP Thu Aug 17 17:57:31 EDT 2006 x86_64
x86_64 x86_64 GNU/Linux
[root@taft-01 ~]# rpm -q cman
Comment 1 David Teigland 2006-09-13 10:50:20 EDT
The immediate cause of this is pretty simple to fix, the sg passed
to process_startdone_barrier_new() has a NULL sevent in some cases
(the deeper question is why).  The function immediately references
sev->flags without checking for NULL which leads to the oops above.
We now just print an error and return if the sevent is NULL instead
of oopsing.

cvs commit: Examining .
Checking in sm_barrier.c;
/cvs/cluster/cluster/cman-kernel/src/Attic/sm_barrier.c,v  <--  sm_barrier.c
new revision:; previous revision:
Comment 2 David Teigland 2006-09-19 12:32:55 EDT
The reason process_startdone_barrier_new() is being called with
a NULL sevent is bug 206193.
Comment 3 Corey Marthaler 2007-04-23 16:48:17 EDT
Just mounted 65 gfs filesytems simultaneously on a 4 node cluster. Marking verified.
Comment 5 Red Hat Bugzilla 2007-05-10 17:22:04 EDT
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.


Note You need to log in before you can comment on or make changes to this bug.