162759 – System occasionally experienced system hangs.

Bug 162759 - System occasionally experienced system hangs.

Summary: System occasionally experienced system hangs.

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 4
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	4.0
Hardware:	x86_64
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Doug Ledford
QA Contact:	Brian Brock
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	168429
TreeView+	depends on / blocked

Reported:	2005-07-08 13:22 UTC by Jeff Burke
Modified:	2007-11-30 22:07 UTC (History)
CC List:	1 user (show)
Fixed In Version:	RHSA-2006-0132
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2006-03-07 19:16:47 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2006:0132	0	qe-ready	SHIPPED_LIVE	Moderate: Updated kernel packages available for Red Hat Enterprise Linux 4 Update 3	2006-03-09 16:31:00 UTC

Description Jeff Burke 2005-07-08 13:22:30 UTC

Description of problem:
  While running RHEL4 stress testing, an x86_64 system occasionally
  experienced system hangs.  Upon taking forced netdumps, the cause
  leading up to the hangs was readily evident.

Version-Release number of selected component (if applicable):


How reproducible:
  Sometimes, but not always.


Steps to Reproduce:
  1.Install RHEL4 2.6.9-11.16(or Higher) x86_64 Kernel.
  2.Download and install ltp-full-20050307.tgz
  3.Run the ltpstress.sh script.
  
  
Actual results:
  System occasionally experienced system hangs.


Expected results:
  System should remain relatively responsive. Considering it is under a stress load.

Additional info:

The problem is in the drivers/md/md.c sync_page_io() function,
where GFP_KERNEL is used as an argument to bio_alloc():

static int sync_page_io(struct block_device *bdev, sector_t sector, int size,
                   struct page *page, int rw)
{
        struct bio *bio = bio_alloc(GFP_KERNEL, 1);
        struct completion event;
        int ret;
...

Because of the use of GFP_KERNEL, under tight memory constraints,
it is possible that the same process may call the drivers/md/md.c
md_write_start() function recursively; the second time it gets
called, it will block forever in mddev_lock_uninterruptible():

void md_write_start(mddev_t *mddev)
{
        if (!atomic_read(&mddev->writes_pending)) {
                mddev_lock_uninterruptible(mddev);
                if (mddev->in_sync) {
                        mddev->in_sync = 0;
                        del_timer(&mddev->safemode_timer);
                        md_update_sb(mddev);
                }
                atomic_inc(&mddev->writes_pending);
                mddev_unlock(mddev);
        } else
                atomic_inc(&mddev->writes_pending);
}

This can happen when the bio_alloc(GFP_KERNEL) call done in
sync_page_io() results in:

 (1) page reclamation being performed due to low memory, and
 (2) the page selected for reclamation is on the same device as the
     original page, leading to recursion back into md_write_start()

Here's an example x86_64 trace; while "pdflush" daemon was doing a page write,
it made the first call to md_write_start() at frame #29, took the semaphore
via mddev_lock_uninterruptible(), called md_update_sb() in frame #28, which
called sync_page_io() in frame #27.  The bio_alloc(GFP_KERNEL) call required
page reclamation to satisfy the request, eventually leading to the call to
make_request() in frame #4, which recursively called md_write_start() in
frame #3, where the ".text.lock.md at ffffffff80293b80" reference is the
mddev_lock_uninterruptible() call in md_write_start():

crash> bt
PID: 8108   TASK: 100281897f0       CPU: 2   COMMAND: "pdflush"
 #0 [10038ea3058] schedule at ffffffff802f9a4a
 #1 [10038ea3130] __sched_text_start at ffffffff802f8d53
 #2 [10038ea3190] __down_failed at ffffffff802fa32f
 #3 [10038ea31e0] .text.lock.md at ffffffff80293b80
 #4 [10038ea3200] make_request at ffffffffa00619b5
 #5 [10038ea3218] mempool_alloc at ffffffff80156bb0
 #6 [10038ea3220] cache_alloc_debugcheck_after at ffffffff8015bf71
 #7 [10038ea3280] generic_make_request at ffffffff8024079e
 #8 [10038ea32d0] submit_bio at ffffffff802408aa
 #9 [10038ea3300] bio_alloc at ffffffff80177db0
#10 [10038ea3330] submit_bh at ffffffff80175ce2
#11 [10038ea3360] __block_write_full_page at ffffffff80176bb9
#12 [10038ea33b0] shrink_zone at ffffffff8016013b
#13 [10038ea34a0] ip_local_deliver at ffffffff802b8e6c
#14 [10038ea34d0] free_pages_bulk at ffffffff8015769c
#15 [10038ea3660] try_to_free_pages at ffffffff80160725
#16 [10038ea36c0] kmem_cache_alloc at ffffffff8015c114
#17 [10038ea3700] __alloc_pages at ffffffff8015802e
#18 [10038ea3730] recalc_task_prio at ffffffff80130649
#19 [10038ea3770] alloc_page_interleave at ffffffff8016e0bc
#20 [10038ea3780] __get_free_pages at ffffffff801581bf
#21 [10038ea3790] kmem_getpages at ffffffff8015b107
#22 [10038ea37b0] cache_alloc_refill at ffffffff8015c3f8
#23 [10038ea37f0] kmem_cache_alloc at ffffffff8015c0fd
#24 [10038ea3810] mempool_alloc at ffffffff80156be9
#25 [10038ea3840] autoremove_wake_function at ffffffff80133983
#26 [10038ea38a0] bio_alloc at ffffffff80177cac
#27 [10038ea38d0] sync_page_io at ffffffff8028e7e1
#28 [10038ea3930] md_update_sb at ffffffff8028ff08
#29 [10038ea39a0] md_write_start at ffffffff80292e25
#30 [10038ea39c0] make_request at ffffffffa00619b5
#31 [10038ea39d8] mempool_alloc at ffffffff80156bb0
#32 [10038ea39e0] cache_alloc_debugcheck_after at ffffffff8015bf71
#33 [10038ea3a40] generic_make_request at ffffffff8024079e
#34 [10038ea3a90] submit_bio at ffffffff802408aa
#35 [10038ea3ac0] bio_alloc at ffffffff80177db0
#36 [10038ea3af0] submit_bh at ffffffff80175ce2
#37 [10038ea3b20] __block_write_full_page at ffffffff80176bb9
#38 [10038ea3b70] ext3_ordered_writepage at ffffffffa007cb26
#39 [10038ea3ba0] mpage_writepages at ffffffff80192b8d
#40 [10038ea3c80] thread_return at ffffffff802f9a74
#41 [10038ea3d20] del_timer at ffffffff8013dcbb
#42 [10038ea3d40] del_singleshot_timer_sync at ffffffff8013dd78
#43 [10038ea3d50] schedule_timeout at ffffffff802fa0d7
#44 [10038ea3db0] __writeback_single_inode at ffffffff801919d5
#45 [10038ea3df0] sync_sb_inodes at ffffffff80192049
#46 [10038ea3e30] writeback_inodes at ffffffff801922e0
#47 [10038ea3e50] background_writeout at ffffffff80159280
#48 [10038ea3ed0] pdflush at ffffffff80159db0
#49 [10038ea3f20] kthread at ffffffff80149003
#50 [10038ea3f50] kernel_thread at ffffffff80110c8f

Eventually things start backing up, waiting for the pdflush
thread to complete its requested task.

Note that the addition of the bio_alloc(GFP_KERNEL) call was
added between 2.6.10 and 2.6.11, and is included in the RHEL4
"linux-2.6.10-ac-selected-bits.patch" patch.

If the argument were GFP_NOIO, then the recursion would be
avoided because the page would not be selected for writeback:

--- linux-2.6.9/drivers/md/md.c.orig
+++ linux-2.6.9/drivers/md/md.c
@@ -364,7 +364,7 @@ static int bi_complete(struct bio *bio,
 static int sync_page_io(struct block_device *bdev, sector_t sector, int size,
                   struct page *page, int rw)
 {
-       struct bio *bio = bio_alloc(GFP_KERNEL, 1);
+       struct bio *bio = bio_alloc(GFP_NOIO, 1);
        struct completion event;
        int ret;

We used the less restrictive GFP_NOFS and never saw the hang after
that, but it appears that it's still possible for a make_request()
to be recursively done to the same device using GFP_NOFS, and that
GPF_NOIO would be safer.

Comment 1 Dave Anderson 2005-08-18 15:44:06 UTC

Update from a couple days ago: Jeff Burke gathered the sysrq-t output from
another 2.6.9-16 deadlock, in which dozens of tasks were blocked in I/O
transactions in mddev_lock_uninterruptible().  There's no vmcore for this one,
but the *recursive* mddev_lock_uninterruptible() caller appears to be the
md1_raid1 task, which first takes the semaphore in md_handle_safemode(), and
then subsequently via the sync_page_io() route, calls bio_alloc(GFP_KERNEL),
leading to the deadlocking mddev_lock_uninterruptible() call:

  md1_raid1 D 0000010037e108d8 0 273 1 274 250 (L-TLB)
  000001013f781588 0000000000000046 7974742f7665642f 0000001900000073
  000001000a3167f0 0000000000000073 000001000565d840 0000000300000246
  000001013f64c7f0 0000000000001eec
  Call Trace:<ffffffff80178420>{__find_get_block_slow+255}
  <ffffffff80302183>{__down+147}
  <ffffffff80132ef9>{default_wake_function+0}
  <ffffffff8030375f>{__down_failed+53}
  <ffffffff8029ce28>{.text.lock.md+155}       <= mddev_lock_uninterruptible()
  <ffffffffa00629b5>{:raid1:make_request+622}
  <ffffffff8024959e>{generic_make_request+355}
  <ffffffff8013478e>{autoremove_wake_function+0}
  <ffffffff802496aa>{submit_bio+247} 
  <ffffffff8017b224>{bio_alloc+288}
  <ffffffff80179156>{submit_bh+255}
  <ffffffff8017a02d>{__block_write_full_page+440}
  <ffffffffa007f40a>{:ext3:ext3_get_block+0}
  <ffffffffa007db46>{:ext3:ext3_ordered_writepage+245}
  <ffffffff80163553>{shrink_zone+3095} 
  <ffffffff801cdf24>{avc_has_perm+70}
  <ffffffff80163b3d>{try_to_free_pages+303}
  <ffffffff8015c4c6>{__alloc_pages+596}
  <ffffffff8015c657>{__get_free_pages+11} 
  <ffffffff8015f4ec>{kmem_getpages+36}
  <ffffffff8015fc81>{cache_alloc_refill+609}
  <ffffffff8015f9bf>{kmem_cache_alloc+90}
  <ffffffff8015b049>{mempool_alloc+186}
  <ffffffff8013478e>{autoremove_wake_function+0}
  <ffffffff80134797>{autoremove_wake_function+9}
  <ffffffff8013478e>{autoremove_wake_function+0}
  <ffffffff8017b120>{bio_alloc+28} 
  <ffffffff80297a89>{sync_page_io+48}
  <ffffffff802991b0>{md_update_sb+263}         
  <ffffffff80131cd9>{finish_task_switch+55}
  <ffffffffa0062ffc>{:raid1:raid1d+0}
  <ffffffff8029c237>{md_handle_safemode+244}  <= mddev_lock_uninterruptible()
  <ffffffffa0063022>{:raid1:raid1d+38} 
  <ffffffff80302ee8>{thread_return+110}
  <ffffffffa0062ffc>{:raid1:raid1d+0} 
  <ffffffff80299c91>{md_thread+392}
  <ffffffff8013478e>{autoremove_wake_function+0}
  <ffffffff80131cd9>{finish_task_switch+55}
  <ffffffff8013478e>{autoremove_wake_function+0}
  <ffffffff80131d28>{schedule_tail+11}
  <ffffffff80110ca3>{child_rip+8} <ffffffffa0062ffc>{:raid1:raid1d+0}
  <ffffffff80299b09>{md_thread+0} <ffffffff80110c9b>{child_rip+0}

Comment 4 Red Hat Bugzilla 2006-03-07 19:16:47 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2006-0132.html

Note You need to log in before you can comment on or make changes to this bug.