Bug 135684

Summary: GFS: Unable to handle kernel paging request (get_local_rgrp)
Product: [Retired] Red Hat Cluster Suite Reporter: Alexander Laamanen <alexander.laamanen>
Component: gfsAssignee: Ken Preslan <kpreslan>
Status: CLOSED ERRATA QA Contact: GFS Bugs <gfs-bugs>
Severity: medium Docs Contact:
Priority: medium    
Version: 3   
Target Milestone: ---   
Target Release: ---   
Hardware: i686   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2004-11-16 15:15:23 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
demonstration patch none

Description Alexander Laamanen 2004-10-14 13:27:50 UTC
Description of problem:

I got the following crash while running heavy IMAP (Courier) load. The
load ran for about 4 hours.

The system:
- 2 nodes (SMP)
- 3TB GFS partition
- lock_dlm in use
- IMAP load only on one node (only one node mounting the filesystem)

Unable to handle kernel paging request at virtual address 00100104
 printing eip:
82b7e4d3
*pde = 00003001
Oops: 0002 [#1]
SMP
Modules linked in: ip_vs_wlc ip_vs rfcomm l2cap bluetooth lock_dlm(U)
dlm(U) cman(U) gfs(U) lock_harness(U) dm_mod md5 ipv6 autofs4 e1000
bonding uhci_hcd button battery asus_acpi ac ext3 jbd raid1 aic79xx
sd_mod scsi_mod
CPU:    0
EIP:    0060:[<82b7e4d3>]    Not tainted
EFLAGS: 00010246   (2.6.8-1.521.rootsmp)
EIP is at recent_rgrp_remove+0x47/0x94 [gfs]
eax: 00100100   ebx: 7d0cec00   ecx: 7d0cec10   edx: 00200200
esi: 82c63000   edi: 1127952c   ebp: 11279400   esp: 18f1ce34
ds: 007b   es: 007b   ss: 0068
Process imapd (pid: 24303, threadinfo=18f1c000 task=51ad8770)
Stack: 00000003 00000003 00000000 00000000 82b7e82a 00000000 00000000
00000000
       00000001 82c63000 34748344 00000000 11279400 34748344 112794f8
82b7e9db
       000002d7 82b86514 00000000 00000000 34748344 00001000 82b74a54
18f1cf18
Call Trace:
 [<82b7e82a>] get_local_rgrp+0xa1/0x1e9 [gfs]
 [<82b7e9db>] gfs_inplace_reserve_i+0x69/0xa1 [gfs]
 [<82b74a54>] do_do_write_buf+0xf2/0x3b0 [gfs]
 [<82b74e10>] do_write_buf+0xfe/0x140 [gfs]
 [<82b7406b>] walk_vm+0xd6/0xfa [gfs]
 [<82b74ef1>] gfs_write+0x9f/0xb8 [gfs]
 [<82b74d12>] do_write_buf+0x0/0x140 [gfs]
 [<0215b2fb>] vfs_write+0xb8/0xe4
 [<0215b3c5>] sys_write+0x3c/0x62
Code: 89 50 04 89 02 c7 41 04 00 02 20 00 8b 93 24 01 00 00 b1 01


Version-Release number of selected component (if applicable):
cvs head 2004-10-14, with bug #135249 fixed.

How reproducible:
At the moment I don't have simple instructions for reproducing the crash.

Comment 1 Alexander Laamanen 2004-10-21 12:46:06 UTC
I have hit this error a few times, but it takes a long time to
reproduce it...
One question that may be related. Is there a reason why
sd_rg_recent_lock is not held while the sd_rg_recent list is being
cleared in clear_rgrpdi()?


Comment 2 Alexander Laamanen 2004-10-29 12:15:59 UTC
Created attachment 105935 [details]
demonstration patch

Hi,

The bug is really in get_local_rgrp(). And it's the fact that recent_rgrp list
is not correctly locked there.

recent_rgrp_first() returns a rgd and while the list is not locked it's still
assumed when calling recent_rgrp_remove() that the entry has not been removed
from the list already.

The attached patch simply iterates the list (inside sd_rg_recent_lock) in
recent_rgrp_remove() and verifies that the entry is in fact still a member of
the list.

However this is probably not an optimal way to fix the problem. Now both the
methods recent_rgrp_next() and recent_rgrp_remove() iterate the same list for
apparently the same reason. Correct locking should be preferred??

The other locking change in the patch (gfs_ri_update() and clear_rgrpdi())is
not directly related, but for me it seems more correct that way... But you may
have a better understanding of the real situation.

Comment 3 Ken Preslan 2004-11-09 09:06:18 UTC
A fix should be in the CVS head now.



Comment 4 Ken Preslan 2004-11-16 15:15:23 UTC
I haven't heard any objections, so I'm marking this as fixed.


Comment 5 John Flanagan 2004-12-21 15:58:29 UTC
An advisory has been issued which should help the problem 
described in this bug report. This report is therefore being 
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files, 
please follow the link below. You may reopen this bug report 
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2004-602.html