174895 – System became unresponsive to local commands.

Bug 174895 - System became unresponsive to local commands.

Summary: System became unresponsive to local commands.

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 4
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	4.3
Hardware:	x86_64
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Doug Ledford
QA Contact:	Brian Brock
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	168429 175195
TreeView+	depends on / blocked

Reported:	2005-12-03 16:20 UTC by Jeff Burke
Modified:	2007-11-30 22:07 UTC (History)
CC List:	3 users (show)
Fixed In Version:	RHSA-2006-0132
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2006-03-07 20:55:54 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
netconsole log file (228.94 KB, text/plain) 2005-12-03 16:25 UTC, Jeff Burke	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2006:0132	0	qe-ready	SHIPPED_LIVE	Moderate: Updated kernel packages available for Red Hat Enterprise Linux 4 Update 3	2006-03-09 16:31:00 UTC

Description Jeff Burke 2005-12-03 16:20:43 UTC

From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.12) Gecko/20050922 Fedora/1.0.7-1.1.fc4 Firefox/1.0.7

Description of problem:
While stress testing the RHEL4-U3 kernel the system was in a "hang state"
I could not log in via the console or ssh. I was able to ping, I was able to issue atl+Sysrq commands.

While this was occuring Larry W examined the system. I will attach the /var/log/messages file.

Version-Release number of selected component (if applicable):
kernel-2.6.9-24.EL

How reproducible:
Sometimes

Steps to Reproduce:
1.I believe the system requires a raid 1 configuration.
2.Run the stress-kernel rpm.


  

Actual Results:  System was in a "hung state"

Expected Results:  System should not hang indefinably.

Additional info:

Comment 2 Jeff Burke 2005-12-03 16:25:47 UTC

Created attachment 121797 [details]
netconsole log file

Comment 6 Larry Woodman 2005-12-06 22:27:23 UTC

The problem here is that the raid1 code takes a semaphore then tries to allocate
memory which can result in a call to try_to_free_pages() which recurses and
tries to take the same semaphore and therefore deadlocks:

-------------------------------------------------------------------------------
md1_raid1     D 0000010037e06cd8     0   305      1           306   270 (L-TLB)
000001013fe615b8 0000000000000046 0000000000000246 0000001900000076 
       00000100412547f0 0000000000000076 0000010005645940 0000000000000246 
       00000100bff177f0 0000000000191eda 
Call Trace:<ffffffff80304263>{__down+147}
<ffffffff801333e9>{default_wake_function+0} 
       <ffffffff80305c74>{__down_failed+53} <ffffffff8029ed1c>{.text.lock.md+155} 
       <ffffffffa00629bb>{:raid1:make_request+622}
<ffffffff8024b316>{generic_make_request+355} 
       <ffffffff80134e12>{autoremove_wake_function+0}
<ffffffff8024b422>{submit_bio+247} 
       <ffffffff8017c10c>{bio_alloc+288} <ffffffff8017a03a>{submit_bh+255} 
       <ffffffff8017af16>{__block_write_full_page+440}
<ffffffff8017e29b>{blkdev_get_block+0} 
       <ffffffff8016412b>{shrink_zone+3095} <ffffffff8013265c>{move_tasks+406} 
       <ffffffff80164715>{try_to_free_pages+303}
<ffffffff80134e12>{autoremove_wake_function+0} 
       <ffffffff8015cf1e>{__alloc_pages+596}
<ffffffff8015d0af>{__get_free_pages+11} 
       <ffffffff80160074>{kmem_getpages+36}
<ffffffff80160809>{cache_alloc_refill+609} 
       <ffffffff80160547>{kmem_cache_alloc+90}
<ffffffff8015baa1>{mempool_alloc+186} 
       <ffffffff80134e12>{autoremove_wake_function+0}
<ffffffff80134e1b>{autoremove_wake_function+9} 
       <ffffffff80134e12>{autoremove_wake_function+0}
<ffffffff8017c008>{bio_alloc+28} 
       <ffffffff8029997d>{sync_page_io+48} <ffffffff8029b0a4>{md_update_sb+263} 
       <ffffffffa0063002>{:raid1:raid1d+0}
<ffffffff8029e12b>{md_handle_safemode+244} 
       <ffffffffa0063028>{:raid1:raid1d+38} <ffffffffa0063002>{:raid1:raid1d+0} 
       <ffffffff8029bb85>{md_thread+392}
<ffffffff80134e12>{autoremove_wake_function+0} 
       <ffffffff80134e12>{autoremove_wake_function+0}
<ffffffff80110e17>{child_rip+8} 
       <ffffffffa0063002>{:raid1:raid1d+0} <ffffffff8029b9fd>{md_thread+0} 
       <ffffffff80110e0f>{child_rip+0} 
--------------------------------------------------------------------------------

The fix is simple, change sync_page_io() to pass GFP_NOIO instead of GFP_KERNEL
to the memory allocation so that the recursion can not occur.  

------------------------------------------------------------------------------
--- linux-2.6.9/drivers/md/md.c.orig
+++ linux-2.6.9/drivers/md/md.c
@@ -364,7 +364,7 @@ static int bi_complete(struct bio *bio,
 static int sync_page_io(struct block_device *bdev, sector_t sector, int size,
                   struct page *page, int rw)
 {
-       struct bio *bio = bio_alloc(GFP_KERNEL, 1);
+       struct bio *bio = bio_alloc(GFP_NOIO, 1);
        struct completion event;
        int ret;
                                                                               
                                   
---------------------------------------------------------------------------

This is not a regression however, this code has been there since RHEL4 gold.

Larry Woodman

Comment 8 Tim Burke 2005-12-06 22:31:30 UTC

Based on comment #6, removing the "regression" severity.

Comment 10 Larry Woodman 2005-12-07 03:32:35 UTC

Jeff, the kernel with the patch applied is located here:

http://porkchop.redhat.com/beehive/comps/dist/4E-scratch/kernel/2.6.9-24NOIO.EL.lwoodman


Larry

Comment 13 Larry Woodman 2005-12-07 20:18:26 UTC

I sent the attached patch to rhkernel-list and it was ACK'd by Dave Jones
because the same fix is upstream.  So, it looks like this will be fixed as
soon as the patch gets committed.

Larry Woodman

Comment 17 Red Hat Bugzilla 2006-03-07 20:55:54 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2006-0132.html

Note You need to log in before you can comment on or make changes to this bug.