Bug 135684 - GFS: Unable to handle kernel paging request (get_local_rgrp)
Summary: GFS: Unable to handle kernel paging request (get_local_rgrp)
Alias: None
Product: Red Hat Cluster Suite
Classification: Retired
Component: gfs (Show other bugs)
(Show other bugs)
Version: 3
Hardware: i686 Linux
Target Milestone: ---
Assignee: Ken Preslan
QA Contact: GFS Bugs
Depends On:
TreeView+ depends on / blocked
Reported: 2004-10-14 13:27 UTC by Alexander Laamanen
Modified: 2010-01-12 02:59 UTC (History)
0 users

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Last Closed: 2004-11-16 15:15:23 UTC
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Attachments (Terms of Use)
demonstration patch (1.65 KB, patch)
2004-10-29 12:15 UTC, Alexander Laamanen
no flags Details | Diff

External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2004:602 normal SHIPPED_LIVE Updated GFS packages 2004-12-21 05:00:00 UTC

Description Alexander Laamanen 2004-10-14 13:27:50 UTC
Description of problem:

I got the following crash while running heavy IMAP (Courier) load. The
load ran for about 4 hours.

The system:
- 2 nodes (SMP)
- 3TB GFS partition
- lock_dlm in use
- IMAP load only on one node (only one node mounting the filesystem)

Unable to handle kernel paging request at virtual address 00100104
 printing eip:
*pde = 00003001
Oops: 0002 [#1]
Modules linked in: ip_vs_wlc ip_vs rfcomm l2cap bluetooth lock_dlm(U)
dlm(U) cman(U) gfs(U) lock_harness(U) dm_mod md5 ipv6 autofs4 e1000
bonding uhci_hcd button battery asus_acpi ac ext3 jbd raid1 aic79xx
sd_mod scsi_mod
CPU:    0
EIP:    0060:[<82b7e4d3>]    Not tainted
EFLAGS: 00010246   (2.6.8-1.521.rootsmp)
EIP is at recent_rgrp_remove+0x47/0x94 [gfs]
eax: 00100100   ebx: 7d0cec00   ecx: 7d0cec10   edx: 00200200
esi: 82c63000   edi: 1127952c   ebp: 11279400   esp: 18f1ce34
ds: 007b   es: 007b   ss: 0068
Process imapd (pid: 24303, threadinfo=18f1c000 task=51ad8770)
Stack: 00000003 00000003 00000000 00000000 82b7e82a 00000000 00000000
       00000001 82c63000 34748344 00000000 11279400 34748344 112794f8
       000002d7 82b86514 00000000 00000000 34748344 00001000 82b74a54
Call Trace:
 [<82b7e82a>] get_local_rgrp+0xa1/0x1e9 [gfs]
 [<82b7e9db>] gfs_inplace_reserve_i+0x69/0xa1 [gfs]
 [<82b74a54>] do_do_write_buf+0xf2/0x3b0 [gfs]
 [<82b74e10>] do_write_buf+0xfe/0x140 [gfs]
 [<82b7406b>] walk_vm+0xd6/0xfa [gfs]
 [<82b74ef1>] gfs_write+0x9f/0xb8 [gfs]
 [<82b74d12>] do_write_buf+0x0/0x140 [gfs]
 [<0215b2fb>] vfs_write+0xb8/0xe4
 [<0215b3c5>] sys_write+0x3c/0x62
Code: 89 50 04 89 02 c7 41 04 00 02 20 00 8b 93 24 01 00 00 b1 01

Version-Release number of selected component (if applicable):
cvs head 2004-10-14, with bug #135249 fixed.

How reproducible:
At the moment I don't have simple instructions for reproducing the crash.

Comment 1 Alexander Laamanen 2004-10-21 12:46:06 UTC
I have hit this error a few times, but it takes a long time to
reproduce it...
One question that may be related. Is there a reason why
sd_rg_recent_lock is not held while the sd_rg_recent list is being
cleared in clear_rgrpdi()?

Comment 2 Alexander Laamanen 2004-10-29 12:15:59 UTC
Created attachment 105935 [details]
demonstration patch


The bug is really in get_local_rgrp(). And it's the fact that recent_rgrp list
is not correctly locked there.

recent_rgrp_first() returns a rgd and while the list is not locked it's still
assumed when calling recent_rgrp_remove() that the entry has not been removed
from the list already.

The attached patch simply iterates the list (inside sd_rg_recent_lock) in
recent_rgrp_remove() and verifies that the entry is in fact still a member of
the list.

However this is probably not an optimal way to fix the problem. Now both the
methods recent_rgrp_next() and recent_rgrp_remove() iterate the same list for
apparently the same reason. Correct locking should be preferred??

The other locking change in the patch (gfs_ri_update() and clear_rgrpdi())is
not directly related, but for me it seems more correct that way... But you may
have a better understanding of the real situation.

Comment 3 Ken Preslan 2004-11-09 09:06:18 UTC
A fix should be in the CVS head now.

Comment 4 Ken Preslan 2004-11-16 15:15:23 UTC
I haven't heard any objections, so I'm marking this as fixed.

Comment 5 John Flanagan 2004-12-21 15:58:29 UTC
An advisory has been issued which should help the problem 
described in this bug report. This report is therefore being 
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files, 
please follow the link below. You may reopen this bug report 
if the solution does not work for you.


Note You need to log in before you can comment on or make changes to this bug.