Bug 246114 - GFS2: soft lockup in rgblk_search
Summary: GFS2: soft lockup in rgblk_search
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel   
(Show other bugs)
Version: 5.0
Hardware: All
OS: Linux
medium
medium
Target Milestone: ---
: ---
Assignee: Don Zickus
QA Contact: Dean Jansa
URL:
Whiteboard:
Keywords:
Depends On: 245832
Blocks: 204760
TreeView+ depends on / blocked
 
Reported: 2007-06-28 16:33 UTC by Abhijith Das
Modified: 2007-11-30 22:07 UTC (History)
2 users (show)

Fixed In Version: RHBA-2007-0959
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2007-11-07 19:54:50 UTC
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
patch to fix the problem--try #1 (724 bytes, patch)
2007-07-12 21:53 UTC, Robert Peterson
no flags Details | Diff


External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2007:0959 normal SHIPPED_LIVE Updated kernel packages for Red Hat Enterprise Linux 5 Update 1 2007-11-08 00:47:37 UTC

Description Abhijith Das 2007-06-28 16:33:44 UTC
Hit this soft lockup while running revolver on smoke cluster. The iteration
killed 3 nodes {kool, merit, salem}, winston hit this lockup.

I'm running the cvs RHEL5 kernel with GFS2 patches to rhkernel-list as of 06/27

 BUG: soft lockup detected on CPU#0!
 [<c014b4ec>] softlockup_tick+0x98/0xa6
 [<c012be40>] update_process_times+0x39/0x5c
 [<c0116adc>] smp_apic_timer_interrupt+0x5b/0x6b
 [<c0104a3b>] apic_timer_interrupt+0x1f/0x24
 [<e0701058>] rgblk_search+0x7e/0x145 [gfs2]
 [<e0701596>] try_rgrp_unlink+0x19/0x59 [gfs2]
 [<e0702127>] gfs2_inplace_reserve_i+0x244/0x57d [gfs2]
 [<c011f31f>] __cond_resched+0x16/0x34
 [<c0152ae3>] find_get_page+0x18/0x38
 [<e06fdf24>] gfs2_sharewrite_nopage+0x17c/0x296 [gfs2]
 [<e06fddeb>] gfs2_sharewrite_nopage+0x43/0x296 [gfs2]
 [<c02a7a1e>] net_rx_action+0x92/0x175
 [<c015e53e>] __handle_mm_fault+0x17d/0x8c8
 [<c0300269>] do_page_fault+0x23c/0x51c
 [<c030002d>] do_page_fault+0x0/0x51c
 [<c0104ac9>] error_code+0x39/0x40

Comment 1 Steve Whitehouse 2007-07-04 14:14:59 UTC
It would be well worth trying out the patch for bz #245832 to see if that fixes
this problem too.


Comment 2 Kiersten (Kerri) Anderson 2007-07-12 17:40:06 UTC
Is this one now verified that it is not #245832?

Comment 3 Robert Peterson 2007-07-12 21:53:34 UTC
Created attachment 159101 [details]
patch to fix the problem--try #1

This patch seems to fix the problem.  It was written by Steve Whitehouse
with some tweaking by me.

The code was looping in the relatively new section of code designed to
search for and reuse unlinked inodes.  In cases where it was finding an
appropriate inode to reuse, it was looping around and finding the same
block over and over because a "<=" check should have been a "<" when
comparing the goal block to the last unlinked block found.

Comment 4 Robert Peterson 2007-07-12 22:17:06 UTC
Fix was tested on the roth cluster using QE's revolver test.
I posted the patch to cluster-devel for inclusion in the upstream
kernel and to rhkernel-list for RHEL51.  I'm changing the status to
POST and reassigning to Don Zickus.

Comment 6 Don Zickus 2007-07-20 18:13:22 UTC
in 2.6.18-35.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Comment 9 errata-xmlrpc 2007-11-07 19:54:50 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2007-0959.html



Note You need to log in before you can comment on or make changes to this bug.