Description of problem: While running RHEL4 stress testing, an x86_64 system occasionally experienced system hangs. Upon taking forced netdumps, the cause leading up to the hangs was readily evident. Version-Release number of selected component (if applicable): How reproducible: Sometimes, but not always. Steps to Reproduce: 1.Install RHEL4 2.6.9-11.16(or Higher) x86_64 Kernel. 2.Download and install ltp-full-20050307.tgz 3.Run the ltpstress.sh script. Actual results: System occasionally experienced system hangs. Expected results: System should remain relatively responsive. Considering it is under a stress load. Additional info: The problem is in the drivers/md/md.c sync_page_io() function, where GFP_KERNEL is used as an argument to bio_alloc(): static int sync_page_io(struct block_device *bdev, sector_t sector, int size, struct page *page, int rw) { struct bio *bio = bio_alloc(GFP_KERNEL, 1); struct completion event; int ret; ... Because of the use of GFP_KERNEL, under tight memory constraints, it is possible that the same process may call the drivers/md/md.c md_write_start() function recursively; the second time it gets called, it will block forever in mddev_lock_uninterruptible(): void md_write_start(mddev_t *mddev) { if (!atomic_read(&mddev->writes_pending)) { mddev_lock_uninterruptible(mddev); if (mddev->in_sync) { mddev->in_sync = 0; del_timer(&mddev->safemode_timer); md_update_sb(mddev); } atomic_inc(&mddev->writes_pending); mddev_unlock(mddev); } else atomic_inc(&mddev->writes_pending); } This can happen when the bio_alloc(GFP_KERNEL) call done in sync_page_io() results in: (1) page reclamation being performed due to low memory, and (2) the page selected for reclamation is on the same device as the original page, leading to recursion back into md_write_start() Here's an example x86_64 trace; while "pdflush" daemon was doing a page write, it made the first call to md_write_start() at frame #29, took the semaphore via mddev_lock_uninterruptible(), called md_update_sb() in frame #28, which called sync_page_io() in frame #27. The bio_alloc(GFP_KERNEL) call required page reclamation to satisfy the request, eventually leading to the call to make_request() in frame #4, which recursively called md_write_start() in frame #3, where the ".text.lock.md at ffffffff80293b80" reference is the mddev_lock_uninterruptible() call in md_write_start(): crash> bt PID: 8108 TASK: 100281897f0 CPU: 2 COMMAND: "pdflush" #0 [10038ea3058] schedule at ffffffff802f9a4a #1 [10038ea3130] __sched_text_start at ffffffff802f8d53 #2 [10038ea3190] __down_failed at ffffffff802fa32f #3 [10038ea31e0] .text.lock.md at ffffffff80293b80 #4 [10038ea3200] make_request at ffffffffa00619b5 #5 [10038ea3218] mempool_alloc at ffffffff80156bb0 #6 [10038ea3220] cache_alloc_debugcheck_after at ffffffff8015bf71 #7 [10038ea3280] generic_make_request at ffffffff8024079e #8 [10038ea32d0] submit_bio at ffffffff802408aa #9 [10038ea3300] bio_alloc at ffffffff80177db0 #10 [10038ea3330] submit_bh at ffffffff80175ce2 #11 [10038ea3360] __block_write_full_page at ffffffff80176bb9 #12 [10038ea33b0] shrink_zone at ffffffff8016013b #13 [10038ea34a0] ip_local_deliver at ffffffff802b8e6c #14 [10038ea34d0] free_pages_bulk at ffffffff8015769c #15 [10038ea3660] try_to_free_pages at ffffffff80160725 #16 [10038ea36c0] kmem_cache_alloc at ffffffff8015c114 #17 [10038ea3700] __alloc_pages at ffffffff8015802e #18 [10038ea3730] recalc_task_prio at ffffffff80130649 #19 [10038ea3770] alloc_page_interleave at ffffffff8016e0bc #20 [10038ea3780] __get_free_pages at ffffffff801581bf #21 [10038ea3790] kmem_getpages at ffffffff8015b107 #22 [10038ea37b0] cache_alloc_refill at ffffffff8015c3f8 #23 [10038ea37f0] kmem_cache_alloc at ffffffff8015c0fd #24 [10038ea3810] mempool_alloc at ffffffff80156be9 #25 [10038ea3840] autoremove_wake_function at ffffffff80133983 #26 [10038ea38a0] bio_alloc at ffffffff80177cac #27 [10038ea38d0] sync_page_io at ffffffff8028e7e1 #28 [10038ea3930] md_update_sb at ffffffff8028ff08 #29 [10038ea39a0] md_write_start at ffffffff80292e25 #30 [10038ea39c0] make_request at ffffffffa00619b5 #31 [10038ea39d8] mempool_alloc at ffffffff80156bb0 #32 [10038ea39e0] cache_alloc_debugcheck_after at ffffffff8015bf71 #33 [10038ea3a40] generic_make_request at ffffffff8024079e #34 [10038ea3a90] submit_bio at ffffffff802408aa #35 [10038ea3ac0] bio_alloc at ffffffff80177db0 #36 [10038ea3af0] submit_bh at ffffffff80175ce2 #37 [10038ea3b20] __block_write_full_page at ffffffff80176bb9 #38 [10038ea3b70] ext3_ordered_writepage at ffffffffa007cb26 #39 [10038ea3ba0] mpage_writepages at ffffffff80192b8d #40 [10038ea3c80] thread_return at ffffffff802f9a74 #41 [10038ea3d20] del_timer at ffffffff8013dcbb #42 [10038ea3d40] del_singleshot_timer_sync at ffffffff8013dd78 #43 [10038ea3d50] schedule_timeout at ffffffff802fa0d7 #44 [10038ea3db0] __writeback_single_inode at ffffffff801919d5 #45 [10038ea3df0] sync_sb_inodes at ffffffff80192049 #46 [10038ea3e30] writeback_inodes at ffffffff801922e0 #47 [10038ea3e50] background_writeout at ffffffff80159280 #48 [10038ea3ed0] pdflush at ffffffff80159db0 #49 [10038ea3f20] kthread at ffffffff80149003 #50 [10038ea3f50] kernel_thread at ffffffff80110c8f Eventually things start backing up, waiting for the pdflush thread to complete its requested task. Note that the addition of the bio_alloc(GFP_KERNEL) call was added between 2.6.10 and 2.6.11, and is included in the RHEL4 "linux-2.6.10-ac-selected-bits.patch" patch. If the argument were GFP_NOIO, then the recursion would be avoided because the page would not be selected for writeback: --- linux-2.6.9/drivers/md/md.c.orig +++ linux-2.6.9/drivers/md/md.c @@ -364,7 +364,7 @@ static int bi_complete(struct bio *bio, static int sync_page_io(struct block_device *bdev, sector_t sector, int size, struct page *page, int rw) { - struct bio *bio = bio_alloc(GFP_KERNEL, 1); + struct bio *bio = bio_alloc(GFP_NOIO, 1); struct completion event; int ret; We used the less restrictive GFP_NOFS and never saw the hang after that, but it appears that it's still possible for a make_request() to be recursively done to the same device using GFP_NOFS, and that GPF_NOIO would be safer.
Update from a couple days ago: Jeff Burke gathered the sysrq-t output from another 2.6.9-16 deadlock, in which dozens of tasks were blocked in I/O transactions in mddev_lock_uninterruptible(). There's no vmcore for this one, but the *recursive* mddev_lock_uninterruptible() caller appears to be the md1_raid1 task, which first takes the semaphore in md_handle_safemode(), and then subsequently via the sync_page_io() route, calls bio_alloc(GFP_KERNEL), leading to the deadlocking mddev_lock_uninterruptible() call: md1_raid1 D 0000010037e108d8 0 273 1 274 250 (L-TLB) 000001013f781588 0000000000000046 7974742f7665642f 0000001900000073 000001000a3167f0 0000000000000073 000001000565d840 0000000300000246 000001013f64c7f0 0000000000001eec Call Trace:<ffffffff80178420>{__find_get_block_slow+255} <ffffffff80302183>{__down+147} <ffffffff80132ef9>{default_wake_function+0} <ffffffff8030375f>{__down_failed+53} <ffffffff8029ce28>{.text.lock.md+155} <= mddev_lock_uninterruptible() <ffffffffa00629b5>{:raid1:make_request+622} <ffffffff8024959e>{generic_make_request+355} <ffffffff8013478e>{autoremove_wake_function+0} <ffffffff802496aa>{submit_bio+247} <ffffffff8017b224>{bio_alloc+288} <ffffffff80179156>{submit_bh+255} <ffffffff8017a02d>{__block_write_full_page+440} <ffffffffa007f40a>{:ext3:ext3_get_block+0} <ffffffffa007db46>{:ext3:ext3_ordered_writepage+245} <ffffffff80163553>{shrink_zone+3095} <ffffffff801cdf24>{avc_has_perm+70} <ffffffff80163b3d>{try_to_free_pages+303} <ffffffff8015c4c6>{__alloc_pages+596} <ffffffff8015c657>{__get_free_pages+11} <ffffffff8015f4ec>{kmem_getpages+36} <ffffffff8015fc81>{cache_alloc_refill+609} <ffffffff8015f9bf>{kmem_cache_alloc+90} <ffffffff8015b049>{mempool_alloc+186} <ffffffff8013478e>{autoremove_wake_function+0} <ffffffff80134797>{autoremove_wake_function+9} <ffffffff8013478e>{autoremove_wake_function+0} <ffffffff8017b120>{bio_alloc+28} <ffffffff80297a89>{sync_page_io+48} <ffffffff802991b0>{md_update_sb+263} <ffffffff80131cd9>{finish_task_switch+55} <ffffffffa0062ffc>{:raid1:raid1d+0} <ffffffff8029c237>{md_handle_safemode+244} <= mddev_lock_uninterruptible() <ffffffffa0063022>{:raid1:raid1d+38} <ffffffff80302ee8>{thread_return+110} <ffffffffa0062ffc>{:raid1:raid1d+0} <ffffffff80299c91>{md_thread+392} <ffffffff8013478e>{autoremove_wake_function+0} <ffffffff80131cd9>{finish_task_switch+55} <ffffffff8013478e>{autoremove_wake_function+0} <ffffffff80131d28>{schedule_tail+11} <ffffffff80110ca3>{child_rip+8} <ffffffffa0062ffc>{:raid1:raid1d+0} <ffffffff80299b09>{md_thread+0} <ffffffff80110c9b>{child_rip+0}
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2006-0132.html