Bug 540811
Summary: | [RHEL5 Xen]: PV guest crash on poweroff | |||
---|---|---|---|---|
Product: | Red Hat Enterprise Linux 5 | Reporter: | Chris Lalancette <clalance> | |
Component: | kernel-xen | Assignee: | Chris Lalancette <clalance> | |
Status: | CLOSED ERRATA | QA Contact: | Red Hat Kernel QE team <kernel-qe> | |
Severity: | medium | Docs Contact: | ||
Priority: | low | |||
Version: | 5.6 | CC: | ijc, qcai, xen-maint | |
Target Milestone: | rc | Keywords: | Regression | |
Target Release: | --- | |||
Hardware: | All | |||
OS: | Linux | |||
Whiteboard: | ||||
Fixed In Version: | Doc Type: | Bug Fix | ||
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 541538 (view as bug list) | Environment: | ||
Last Closed: | 2010-03-30 07:40:50 UTC | Type: | --- | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 526946 |
Description
Chris Lalancette
2009-11-24 08:45:20 UTC
Additional notes: Using the steps above and the -164 kernel inside the guest, it seems to be reproducible. Using the -128 kernel inside the guest, it is *not* reproducible, so it's a regression between 5.3 and 5.4. Chris Lalancette Bizarrely, after bisecting this, it's this commit that causes a problem: commit 911d74df73a60067a0d4f31f364e521077a8854c Author: Chris Lalancette <clalance> Date: Thu Mar 5 14:13:05 2009 +0100 [xen] xen reports bogus LowTotal Message-id: 49AFCFE1.9050501 O-Subject: [RHEL5.4 PATCH]: Xen reports bogus LowTotal Bugzilla: 428892 RH-Acked-by: Don Dutile <ddutile> RH-Acked-by: Rik van Riel <riel> All, The xen kernel can report a LowTotal of 4Tb on a system, even though th system only has 3.5Gb of memory. That's obviously totally bogus. The probl is that the balloon driver wasn't properly accounting for totalhigh_pages in it's calculations, which screws up the rest of the reporting in the system. This is a straightforward backport of linux-2.6.18-xen.hg c/s 79 and 128, an seems to fix the problem for the reporter. This will fix BZ 428892. Please review and ACK -- Chris Lalancette diff --git a/drivers/xen/balloon/balloon.c b/drivers/xen/balloon/balloon.c index 39d7185..e8ce44f 100644 --- a/drivers/xen/balloon/balloon.c +++ b/drivers/xen/balloon/balloon.c @@ -93,6 +93,15 @@ static unsigned long frame_list[PAGE_SIZE / sizeof(unsigned l /* VM /proc information for memory */ extern unsigned long totalram_pages; +#ifndef MODULE +extern unsigned long totalhigh_pages; +#define inc_totalhigh_pages() (totalhigh_pages++) +#define dec_totalhigh_pages() (totalhigh_pages--) +#else +#define inc_totalhigh_pages() ((void)0) +#define dec_totalhigh_pages() ((void)0) +#endif + /* We may hit the hard limit in Xen. If we do then we remember it. */ static unsigned long hard_limit; @@ -137,6 +146,7 @@ static void balloon_append(struct page *page) if (PageHighMem(page)) { list_add_tail(PAGE_TO_LIST(page), &ballooned_pages); balloon_high++; + dec_totalhigh_pages(); } else { list_add(PAGE_TO_LIST(page), &ballooned_pages); balloon_low++; @@ -154,8 +164,10 @@ static struct page *balloon_retrieve(void) page = LIST_TO_PAGE(ballooned_pages.next); UNLIST_PAGE(page); - if (PageHighMem(page)) + if (PageHighMem(page)) { balloon_high--; + inc_totalhigh_pages(); + } else balloon_low--; Reverting that commit, and only that commit, makes the problem go away. However, the crash really doesn't have anything directly to do with the totalhigh_pages. My analysis of the crash so far is: mm/mempool.c:mempool_alloc() crashes at line 220, accessing 00000010. That means at that line, pool is NULL, and it's trying to access NULL->pool_data. Going back further in the stack, mempool_alloc() is being called from mm/highmem.c:__blk_queue_bounce(), line 409. The NULL pool is just passed into there. mm/highmem.c:blk_queue_bounce() is the one that actually figures out the pool. However, this is quite strange; the pool is set to one of two static pools, either isa_page_pool or page_pool. There is a BUG(!isa_page_pool), so we are probably not going through that path. However, page_pool should *never* be NULL; it's was initialized early on during boot, and is never changed. So this leads us to one of two things; either that is being initialized, and is later being clobbered (memory corruption), or we should never come into that path with Xen (which I'm just not sure about). Chris Lalancette Got it. We are missing upstream linux-2.6.18-xen.hg c/s 148: # HG changeset patch # User Ian Campbell <ian.campbell> # Date 1185543936 -3600 # Node ID 667228bf8fc5f1a21719e11c7eb269d0188a2d60 # Parent 88a17da7f3362126182423100a9d7d4c0d854139 BLKFRONT: Make sure we don't use bounce buffers, we don't need them. Signed-off-by: Ian Campbell <ian.campbell> diff -r 88a17da7f336 -r 667228bf8fc5 drivers/xen/blkfront/vbd.c --- a/drivers/xen/blkfront/vbd.c Thu Jul 26 16:36:52 2007 +0100 +++ b/drivers/xen/blkfront/vbd.c Fri Jul 27 14:45:36 2007 +0100 @@ -213,6 +213,9 @@ /* Make sure buffer addresses are sector-aligned. */ blk_queue_dma_alignment(rq, 511); + /* Make sure we don't use bounce buffers. */ + blk_queue_bounce_limit(rq, BLK_BOUNCE_ANY); + gd->queue = rq; return 0; With this in place, my reproducer in the summary works just fine. I'll get this ready for inclusion. Chris Lalancette in kernel-2.6.18-177.el5 You can download this test kernel from http://people.redhat.com/dzickus/el5 Please do NOT transition this bugzilla state to VERIFIED until our QE team has sent specific instructions indicating when to do so. However feel free to provide a comment indicating that this fix has been verified. An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2010-0178.html |