Description of problem:
I got the following crash while running heavy IMAP (Courier) load. The
load ran for about 4 hours.
- 2 nodes (SMP)
- 3TB GFS partition
- lock_dlm in use
- IMAP load only on one node (only one node mounting the filesystem)
Unable to handle kernel paging request at virtual address 00100104
*pde = 00003001
Oops: 0002 [#1]
Modules linked in: ip_vs_wlc ip_vs rfcomm l2cap bluetooth lock_dlm(U)
dlm(U) cman(U) gfs(U) lock_harness(U) dm_mod md5 ipv6 autofs4 e1000
bonding uhci_hcd button battery asus_acpi ac ext3 jbd raid1 aic79xx
EIP: 0060:[<82b7e4d3>] Not tainted
EFLAGS: 00010246 (2.6.8-1.521.rootsmp)
EIP is at recent_rgrp_remove+0x47/0x94 [gfs]
eax: 00100100 ebx: 7d0cec00 ecx: 7d0cec10 edx: 00200200
esi: 82c63000 edi: 1127952c ebp: 11279400 esp: 18f1ce34
ds: 007b es: 007b ss: 0068
Process imapd (pid: 24303, threadinfo=18f1c000 task=51ad8770)
Stack: 00000003 00000003 00000000 00000000 82b7e82a 00000000 00000000
00000001 82c63000 34748344 00000000 11279400 34748344 112794f8
000002d7 82b86514 00000000 00000000 34748344 00001000 82b74a54
[<82b7e82a>] get_local_rgrp+0xa1/0x1e9 [gfs]
[<82b7e9db>] gfs_inplace_reserve_i+0x69/0xa1 [gfs]
[<82b74a54>] do_do_write_buf+0xf2/0x3b0 [gfs]
[<82b74e10>] do_write_buf+0xfe/0x140 [gfs]
[<82b7406b>] walk_vm+0xd6/0xfa [gfs]
[<82b74ef1>] gfs_write+0x9f/0xb8 [gfs]
[<82b74d12>] do_write_buf+0x0/0x140 [gfs]
Code: 89 50 04 89 02 c7 41 04 00 02 20 00 8b 93 24 01 00 00 b1 01
Version-Release number of selected component (if applicable):
cvs head 2004-10-14, with bug #135249 fixed.
At the moment I don't have simple instructions for reproducing the crash.
I have hit this error a few times, but it takes a long time to
One question that may be related. Is there a reason why
sd_rg_recent_lock is not held while the sd_rg_recent list is being
cleared in clear_rgrpdi()?
Created attachment 105935 [details]
The bug is really in get_local_rgrp(). And it's the fact that recent_rgrp list
is not correctly locked there.
recent_rgrp_first() returns a rgd and while the list is not locked it's still
assumed when calling recent_rgrp_remove() that the entry has not been removed
from the list already.
The attached patch simply iterates the list (inside sd_rg_recent_lock) in
recent_rgrp_remove() and verifies that the entry is in fact still a member of
However this is probably not an optimal way to fix the problem. Now both the
methods recent_rgrp_next() and recent_rgrp_remove() iterate the same list for
apparently the same reason. Correct locking should be preferred??
The other locking change in the patch (gfs_ri_update() and clear_rgrpdi())is
not directly related, but for me it seems more correct that way... But you may
have a better understanding of the real situation.
A fix should be in the CVS head now.
I haven't heard any objections, so I'm marking this as fixed.
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.