Bug 174895
Summary: | System became unresponsive to local commands. | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 4 | Reporter: | Jeff Burke <jburke> | ||||
Component: | kernel | Assignee: | Doug Ledford <dledford> | ||||
Status: | CLOSED ERRATA | QA Contact: | Brian Brock <bbrock> | ||||
Severity: | medium | Docs Contact: | |||||
Priority: | medium | ||||||
Version: | 4.3 | CC: | dledford, jbaron, lwoodman | ||||
Target Milestone: | --- | Keywords: | Regression | ||||
Target Release: | --- | ||||||
Hardware: | x86_64 | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | RHSA-2006-0132 | Doc Type: | Bug Fix | ||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2006-03-07 20:55:54 UTC | Type: | --- | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | |||||||
Bug Blocks: | 168429, 175195 | ||||||
Attachments: |
|
Description
Jeff Burke
2005-12-03 16:20:43 UTC
Created attachment 121797 [details]
netconsole log file
The problem here is that the raid1 code takes a semaphore then tries to allocate memory which can result in a call to try_to_free_pages() which recurses and tries to take the same semaphore and therefore deadlocks: ------------------------------------------------------------------------------- md1_raid1 D 0000010037e06cd8 0 305 1 306 270 (L-TLB) 000001013fe615b8 0000000000000046 0000000000000246 0000001900000076 00000100412547f0 0000000000000076 0000010005645940 0000000000000246 00000100bff177f0 0000000000191eda Call Trace:<ffffffff80304263>{__down+147} <ffffffff801333e9>{default_wake_function+0} <ffffffff80305c74>{__down_failed+53} <ffffffff8029ed1c>{.text.lock.md+155} <ffffffffa00629bb>{:raid1:make_request+622} <ffffffff8024b316>{generic_make_request+355} <ffffffff80134e12>{autoremove_wake_function+0} <ffffffff8024b422>{submit_bio+247} <ffffffff8017c10c>{bio_alloc+288} <ffffffff8017a03a>{submit_bh+255} <ffffffff8017af16>{__block_write_full_page+440} <ffffffff8017e29b>{blkdev_get_block+0} <ffffffff8016412b>{shrink_zone+3095} <ffffffff8013265c>{move_tasks+406} <ffffffff80164715>{try_to_free_pages+303} <ffffffff80134e12>{autoremove_wake_function+0} <ffffffff8015cf1e>{__alloc_pages+596} <ffffffff8015d0af>{__get_free_pages+11} <ffffffff80160074>{kmem_getpages+36} <ffffffff80160809>{cache_alloc_refill+609} <ffffffff80160547>{kmem_cache_alloc+90} <ffffffff8015baa1>{mempool_alloc+186} <ffffffff80134e12>{autoremove_wake_function+0} <ffffffff80134e1b>{autoremove_wake_function+9} <ffffffff80134e12>{autoremove_wake_function+0} <ffffffff8017c008>{bio_alloc+28} <ffffffff8029997d>{sync_page_io+48} <ffffffff8029b0a4>{md_update_sb+263} <ffffffffa0063002>{:raid1:raid1d+0} <ffffffff8029e12b>{md_handle_safemode+244} <ffffffffa0063028>{:raid1:raid1d+38} <ffffffffa0063002>{:raid1:raid1d+0} <ffffffff8029bb85>{md_thread+392} <ffffffff80134e12>{autoremove_wake_function+0} <ffffffff80134e12>{autoremove_wake_function+0} <ffffffff80110e17>{child_rip+8} <ffffffffa0063002>{:raid1:raid1d+0} <ffffffff8029b9fd>{md_thread+0} <ffffffff80110e0f>{child_rip+0} -------------------------------------------------------------------------------- The fix is simple, change sync_page_io() to pass GFP_NOIO instead of GFP_KERNEL to the memory allocation so that the recursion can not occur. ------------------------------------------------------------------------------ --- linux-2.6.9/drivers/md/md.c.orig +++ linux-2.6.9/drivers/md/md.c @@ -364,7 +364,7 @@ static int bi_complete(struct bio *bio, static int sync_page_io(struct block_device *bdev, sector_t sector, int size, struct page *page, int rw) { - struct bio *bio = bio_alloc(GFP_KERNEL, 1); + struct bio *bio = bio_alloc(GFP_NOIO, 1); struct completion event; int ret; --------------------------------------------------------------------------- This is not a regression however, this code has been there since RHEL4 gold. Larry Woodman Based on comment #6, removing the "regression" severity. Jeff, the kernel with the patch applied is located here: http://porkchop.redhat.com/beehive/comps/dist/4E-scratch/kernel/2.6.9-24NOIO.EL.lwoodman Larry I sent the attached patch to rhkernel-list and it was ACK'd by Dave Jones because the same fix is upstream. So, it looks like this will be fixed as soon as the patch gets committed. Larry Woodman An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2006-0132.html |