Bug 135684

Summary:

GFS: Unable to handle kernel paging request (get_local_rgrp)

Product:

[Retired] Red Hat Cluster Suite

Reporter:

Alexander Laamanen <alexander.laamanen>

Component:

gfs

Assignee:

Ken Preslan <kpreslan>

Status:

CLOSED ERRATA

QA Contact:

GFS Bugs <gfs-bugs>

Severity:

medium

Docs Contact:

Priority:

medium

Version:

Target Milestone:

---

Target Release:

---

Hardware:

i686

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2004-11-16 15:15:23 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
demonstration patch	none

Description Alexander Laamanen 2004-10-14 13:27:50 UTC

Description of problem:

I got the following crash while running heavy IMAP (Courier) load. The
load ran for about 4 hours.

The system:
- 2 nodes (SMP)
- 3TB GFS partition
- lock_dlm in use
- IMAP load only on one node (only one node mounting the filesystem)

Unable to handle kernel paging request at virtual address 00100104
 printing eip:
82b7e4d3
*pde = 00003001
Oops: 0002 [#1]
SMP
Modules linked in: ip_vs_wlc ip_vs rfcomm l2cap bluetooth lock_dlm(U)
dlm(U) cman(U) gfs(U) lock_harness(U) dm_mod md5 ipv6 autofs4 e1000
bonding uhci_hcd button battery asus_acpi ac ext3 jbd raid1 aic79xx
sd_mod scsi_mod
CPU:    0
EIP:    0060:[<82b7e4d3>]    Not tainted
EFLAGS: 00010246   (2.6.8-1.521.rootsmp)
EIP is at recent_rgrp_remove+0x47/0x94 [gfs]
eax: 00100100   ebx: 7d0cec00   ecx: 7d0cec10   edx: 00200200
esi: 82c63000   edi: 1127952c   ebp: 11279400   esp: 18f1ce34
ds: 007b   es: 007b   ss: 0068
Process imapd (pid: 24303, threadinfo=18f1c000 task=51ad8770)
Stack: 00000003 00000003 00000000 00000000 82b7e82a 00000000 00000000
00000000
       00000001 82c63000 34748344 00000000 11279400 34748344 112794f8
82b7e9db
       000002d7 82b86514 00000000 00000000 34748344 00001000 82b74a54
18f1cf18
Call Trace:
 [<82b7e82a>] get_local_rgrp+0xa1/0x1e9 [gfs]
 [<82b7e9db>] gfs_inplace_reserve_i+0x69/0xa1 [gfs]
 [<82b74a54>] do_do_write_buf+0xf2/0x3b0 [gfs]
 [<82b74e10>] do_write_buf+0xfe/0x140 [gfs]
 [<82b7406b>] walk_vm+0xd6/0xfa [gfs]
 [<82b74ef1>] gfs_write+0x9f/0xb8 [gfs]
 [<82b74d12>] do_write_buf+0x0/0x140 [gfs]
 [<0215b2fb>] vfs_write+0xb8/0xe4
 [<0215b3c5>] sys_write+0x3c/0x62
Code: 89 50 04 89 02 c7 41 04 00 02 20 00 8b 93 24 01 00 00 b1 01


Version-Release number of selected component (if applicable):
cvs head 2004-10-14, with bug #135249 fixed.

How reproducible:
At the moment I don't have simple instructions for reproducing the crash.

Comment 1 Alexander Laamanen 2004-10-21 12:46:06 UTC

I have hit this error a few times, but it takes a long time to
reproduce it...
One question that may be related. Is there a reason why
sd_rg_recent_lock is not held while the sd_rg_recent list is being
cleared in clear_rgrpdi()?

Comment 2 Alexander Laamanen 2004-10-29 12:15:59 UTC

Created attachment 105935 [details]
demonstration patch

Hi,

The bug is really in get_local_rgrp(). And it's the fact that recent_rgrp list
is not correctly locked there.

recent_rgrp_first() returns a rgd and while the list is not locked it's still
assumed when calling recent_rgrp_remove() that the entry has not been removed
from the list already.

The attached patch simply iterates the list (inside sd_rg_recent_lock) in
recent_rgrp_remove() and verifies that the entry is in fact still a member of
the list.

However this is probably not an optimal way to fix the problem. Now both the
methods recent_rgrp_next() and recent_rgrp_remove() iterate the same list for
apparently the same reason. Correct locking should be preferred??

The other locking change in the patch (gfs_ri_update() and clear_rgrpdi())is
not directly related, but for me it seems more correct that way... But you may
have a better understanding of the real situation.

Comment 3 Ken Preslan 2004-11-09 09:06:18 UTC

A fix should be in the CVS head now.

Comment 4 Ken Preslan 2004-11-16 15:15:23 UTC

I haven't heard any objections, so I'm marking this as fixed.

Comment 5 John Flanagan 2004-12-21 15:58:29 UTC

An advisory has been issued which should help the problem 
described in this bug report. This report is therefore being 
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files, 
please follow the link below. You may reopen this bug report 
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2004-602.html