Bug 353311 - lock_dlm oops in process_finish()
lock_dlm oops in process_finish()
Status: CLOSED CURRENTRELEASE
Product: Red Hat Cluster Suite
Classification: Red Hat
Component: GFS-kernel (Show other bugs)
4
All Linux
low Severity low
: ---
: ---
Assigned To: David Teigland
Cluster QE
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2007-10-25 17:34 EDT by David Teigland
Modified: 2010-01-11 22:19 EST (History)
6 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2009-03-12 15:56:00 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
patch to try (465 bytes, text/plain)
2007-10-30 14:10 EDT, David Teigland
no flags Details

  None (edit)
Description David Teigland 2007-10-25 17:34:00 EDT
Description of problem:

Hit when running the test in bug 299061 comment 55.

Unable to handle kernel paging request at 0000000000100100 RIP:
<ffffffffa0230b3d>{:lock_dlm:process_finish+87}
PML4 129610067 PGD 76495067 PMD 0
Oops: 0000 [1] SMP
CPU 0   
Modules linked in: gfs(U) lock_dlm(U) dlm(U) cman(U) parport_pc lp parport autof
s4 i2c_dev i2c_core lock_harness(U) md5 ipv6 sunrpc ds yenta_socket pcmcia_core 
button battery ac joydev ohci_hcd ehci_hcd k8_edac edac_mc forcedeth dm_snapshot
 dm_zero dm_mirror ext3 jbd dm_mod qla2400 sata_nv libata qla2xxx scsi_transport
_fc sd_mod scsi_mod
Pid: 20986, comm: lock_dlm2 Tainted: GF     2.6.9-55.ELsmp
RIP: 0010:[<ffffffffa0230b3d>] <ffffffffa0230b3d>{:lock_dlm:process_finish+87}
RSP: 0018:00000100766f7e38  EFLAGS: 00010246
RAX: 0000010001192308 RBX: 0000010128f8c600 RCX: 0000010080000000
RDX: 0000000000000202 RSI: 0000000000100100 RDI: 000001007e3fe280
RBP: 0000010001192200 R08: ffffffffa0205b58 R09: 0000000000000000
R10: 0000000100000000 R11: 00000101278d2d00 R12: 0000000000100100
R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
FS:  0000002a95562de0(0000) GS:ffffffff804ed700(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 0000000000100100 CR3: 0000000000101000 CR4: 00000000000006e0
Process lock_dlm2 (pid: 20986, threadinfo 00000100766f6000, task 00000101278507f0)
Stack: 0000000100ecdebe 00000101278c6dc0 0000010001192200 0000000000000001
       0000000000000000 ffffffffa0235bf8 0000000176edc7f0 0000000000000001


Disassembling the module, the oops appears to be when dereferencing
dlm->mg_nodes.  I see one place where the mg_nodes list is modified
without holding mg_nodes_lock, and that's in release_mg_nodes().


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:
Comment 1 David Teigland 2007-10-25 17:39:48 EDT
a possible fix to try:

RCS file: /cvs/cluster/cluster/gfs-kernel/src/dlm/Attic/group.c,v
retrieving revision 1.8.2.2
diff -u -r1.8.2.2 group.c
--- group.c     29 Jun 2005 07:28:21 -0000      1.8.2.2
+++ group.c     25 Oct 2007 21:39:23 -0000
@@ -392,11 +392,13 @@
 {
        dlm_node_t *node, *safe;
 
+       down(&dlm->mg_nodes_lock);
        list_for_each_entry_safe(node, safe, &dlm->mg_nodes, list) {
                list_del(&node->list);
                lm_dlm_release_withdraw(dlm, node);
                kfree(node);
        }
+       up(&dlm->mg_nodes_lock);
 }
 
Comment 2 David Teigland 2007-10-25 17:54:14 EDT
patch in comment 1 doesn't work, maybe this one...

RCS file: /cvs/cluster/cluster/gfs-kernel/src/dlm/Attic/group.c,v
retrieving revision 1.8.2.2
diff -u -r1.8.2.2 group.c
--- group.c     29 Jun 2005 07:28:21 -0000      1.8.2.2
+++ group.c     25 Oct 2007 21:53:37 -0000
@@ -751,7 +751,9 @@
        kcl_unregister_service(dlm->mg_local_id);
 
        release_jid(dlm);
+       down(&dlm->mg_nodes_lock);
        release_mg_nodes(dlm);
+       up(&dlm->mg_nodes_lock);
 }
 
Comment 3 David Teigland 2007-10-30 14:10:06 EDT
Created attachment 243731 [details]
patch to try

same patch as in comment 2
Comment 4 David Teigland 2008-01-14 10:30:12 EST
patch checked into RHEL4 branch

Checking in group.c;
/cvs/cluster/cluster/gfs-kernel/src/dlm/Attic/group.c,v  <--  group.c
new revision: 1.8.2.3; previous revision: 1.8.2.2
done
Comment 5 Steve Whitehouse 2009-01-20 10:23:06 EST
Adding missing flags.
Comment 6 Chris Feist 2009-03-12 15:56:00 EDT
Already fixed in 4.7.

Note You need to log in before you can comment on or make changes to this bug.