Bug 135684 - GFS: Unable to handle kernel paging request (get_local_rgrp)
GFS: Unable to handle kernel paging request (get_local_rgrp)
Status: CLOSED ERRATA
Product: Red Hat Cluster Suite
Classification: Red Hat
Component: gfs (Show other bugs)
3
i686 Linux
medium Severity medium
: ---
: ---
Assigned To: Ken Preslan
GFS Bugs
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2004-10-14 09:27 EDT by Alexander Laamanen
Modified: 2010-01-11 21:59 EST (History)
0 users

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2004-11-16 10:15:23 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
demonstration patch (1.65 KB, patch)
2004-10-29 08:15 EDT, Alexander Laamanen
no flags Details | Diff

  None (edit)
Description Alexander Laamanen 2004-10-14 09:27:50 EDT
Description of problem:

I got the following crash while running heavy IMAP (Courier) load. The
load ran for about 4 hours.

The system:
- 2 nodes (SMP)
- 3TB GFS partition
- lock_dlm in use
- IMAP load only on one node (only one node mounting the filesystem)

Unable to handle kernel paging request at virtual address 00100104
 printing eip:
82b7e4d3
*pde = 00003001
Oops: 0002 [#1]
SMP
Modules linked in: ip_vs_wlc ip_vs rfcomm l2cap bluetooth lock_dlm(U)
dlm(U) cman(U) gfs(U) lock_harness(U) dm_mod md5 ipv6 autofs4 e1000
bonding uhci_hcd button battery asus_acpi ac ext3 jbd raid1 aic79xx
sd_mod scsi_mod
CPU:    0
EIP:    0060:[<82b7e4d3>]    Not tainted
EFLAGS: 00010246   (2.6.8-1.521.rootsmp)
EIP is at recent_rgrp_remove+0x47/0x94 [gfs]
eax: 00100100   ebx: 7d0cec00   ecx: 7d0cec10   edx: 00200200
esi: 82c63000   edi: 1127952c   ebp: 11279400   esp: 18f1ce34
ds: 007b   es: 007b   ss: 0068
Process imapd (pid: 24303, threadinfo=18f1c000 task=51ad8770)
Stack: 00000003 00000003 00000000 00000000 82b7e82a 00000000 00000000
00000000
       00000001 82c63000 34748344 00000000 11279400 34748344 112794f8
82b7e9db
       000002d7 82b86514 00000000 00000000 34748344 00001000 82b74a54
18f1cf18
Call Trace:
 [<82b7e82a>] get_local_rgrp+0xa1/0x1e9 [gfs]
 [<82b7e9db>] gfs_inplace_reserve_i+0x69/0xa1 [gfs]
 [<82b74a54>] do_do_write_buf+0xf2/0x3b0 [gfs]
 [<82b74e10>] do_write_buf+0xfe/0x140 [gfs]
 [<82b7406b>] walk_vm+0xd6/0xfa [gfs]
 [<82b74ef1>] gfs_write+0x9f/0xb8 [gfs]
 [<82b74d12>] do_write_buf+0x0/0x140 [gfs]
 [<0215b2fb>] vfs_write+0xb8/0xe4
 [<0215b3c5>] sys_write+0x3c/0x62
Code: 89 50 04 89 02 c7 41 04 00 02 20 00 8b 93 24 01 00 00 b1 01


Version-Release number of selected component (if applicable):
cvs head 2004-10-14, with bug #135249 fixed.

How reproducible:
At the moment I don't have simple instructions for reproducing the crash.
Comment 1 Alexander Laamanen 2004-10-21 08:46:06 EDT
I have hit this error a few times, but it takes a long time to
reproduce it...
One question that may be related. Is there a reason why
sd_rg_recent_lock is not held while the sd_rg_recent list is being
cleared in clear_rgrpdi()?
Comment 2 Alexander Laamanen 2004-10-29 08:15:59 EDT
Created attachment 105935 [details]
demonstration patch

Hi,

The bug is really in get_local_rgrp(). And it's the fact that recent_rgrp list
is not correctly locked there.

recent_rgrp_first() returns a rgd and while the list is not locked it's still
assumed when calling recent_rgrp_remove() that the entry has not been removed
from the list already.

The attached patch simply iterates the list (inside sd_rg_recent_lock) in
recent_rgrp_remove() and verifies that the entry is in fact still a member of
the list.

However this is probably not an optimal way to fix the problem. Now both the
methods recent_rgrp_next() and recent_rgrp_remove() iterate the same list for
apparently the same reason. Correct locking should be preferred??

The other locking change in the patch (gfs_ri_update() and clear_rgrpdi())is
not directly related, but for me it seems more correct that way... But you may
have a better understanding of the real situation.
Comment 3 Ken Preslan 2004-11-09 04:06:18 EST
A fix should be in the CVS head now.

Comment 4 Ken Preslan 2004-11-16 10:15:23 EST
I haven't heard any objections, so I'm marking this as fixed.
Comment 5 John Flanagan 2004-12-21 10:58:29 EST
An advisory has been issued which should help the problem 
described in this bug report. This report is therefore being 
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files, 
please follow the link below. You may reopen this bug report 
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2004-602.html

Note You need to log in before you can comment on or make changes to this bug.