Bug 622413 - "xm save" fail for rhel5.5 pv guest on ia64 platform when maxmem > memory
Summary: "xm save" fail for rhel5.5 pv guest on ia64 platform when maxmem > memory
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: xen
Version: 5.6
Hardware: ia64
OS: Linux
medium
medium
Target Milestone: rc
: ---
Assignee: Miroslav Rezanina
QA Contact: Virtualization Bugs
URL:
Whiteboard:
Depends On:
Blocks: 514499 634502
TreeView+ depends on / blocked
 
Reported: 2010-08-09 09:51 UTC by XinSun
Modified: 2011-01-13 22:23 UTC (History)
8 users (show)

Fixed In Version: xen-3.0.3-117.el5
Doc Type: Bug Fix
Doc Text:
Clone Of:
: 634502 (view as bug list)
Environment:
Last Closed: 2011-01-13 22:23:35 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
rhel5u5-ia64-pv (398 bytes, application/octet-stream)
2010-08-09 09:51 UTC, XinSun
no flags Details
xend.log (15.32 KB, text/plain)
2010-08-09 09:52 UTC, XinSun
no flags Details
xm_dmesg.txt (16.00 KB, text/plain)
2010-08-11 08:12 UTC, XinSun
no flags Details
xm_dmesg_hvm.txt (16.00 KB, text/plain)
2010-08-13 07:45 UTC, XinSun
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2011:0031 0 normal SHIPPED_LIVE xen bug fix and enhancement update 2011-01-12 15:59:24 UTC

Description XinSun 2010-08-09 09:51:08 UTC
Created attachment 437556 [details]
rhel5u5-ia64-pv

Description of problem:
On ia64 platform, the rhel5.5 pv guest will save failure when its maxmem > memory. But if we set maxmem=memory, "xm save" will success.

Version-Release number of selected component (if applicable):
xen-3.0.3-115.el5
kernel-xen-2.6.18-211.el5

How reproducible:
Always

Steps to Reproduce:
1.Edit a rhel5.5 pv guest to set maxmem > memory
2.Create the rhel5.5 pv guest:
  [host]# xm create rhel5u5-ia64-pv
3.Do "xm save" for this rhel5.5 pv guest
  [host]# xm save rhel5u5-ia64-pv save_rhel
  

Actual results:
"xm save" will fail:
[host]# xm save rhel5u5-ia64-pv save_rhel
Error: /usr/lib/xen/bin/xc_save 24 62 0 0 0 failed
Usage: xm save <Domain> <CheckpointFile>
Save a domain state to restore later.


Expected results:
"xm save" will success, and the save file will be generated.

Additional info:
Please see the attachment: rhel5u5-ia64-pv
                           xend.log

Comment 1 XinSun 2010-08-09 09:52:22 UTC
Created attachment 437557 [details]
xend.log

Attachment xend.log

Comment 2 Miroslav Rezanina 2010-08-09 11:06:18 UTC
Hi XinSun, can you please try this scenario with xen-3.0.3-105.el5 version. We do some changes to maxmem handlig and I'd like to find out if this is related to it.

Comment 3 XinSun 2010-08-10 07:28:06 UTC
(In reply to comment #2)
> Hi XinSun, can you please try this scenario with xen-3.0.3-105.el5 version. We
> do some changes to maxmem handlig and I'd like to find out if this is related
> to it.    

Try this scenario with xen-3.0.3-105.el5 on ia64 platform again. With 105 build, "xm save" success when maxmem > memory (maxmem=1024, memroy=512).

Comment 4 Andrew Jones 2010-08-10 08:16:16 UTC
With -105 does it both save and restore correctly? And in the case with -115, but maxmem==memory does it both save and restore correctly? And when doing the test where maxmem==memory are you using the 1024, or the 512? (just to make sure that the max amount of memory is working we should be using the 1024).

There were a handful of changes that touched maxmem between -105 and -115. Perhaps the most suspicious for this problem is the last one though, since it's on the restore path. Could you please try -114?

Also, when you reproduce the problem, could you please capture 'xm dmesg'.

Thanks,
Drew

Comment 6 XinSun 2010-08-11 08:10:51 UTC
- Try follow scenarios on xen 114 build:
(1)Set maxmem > memory (maxmem=1024, memory=512)
"xm save" fail, and you can see the xm_dmesg.txt file

(2)Set maxmem==memory (maxmem=1024, memory=1024)
"xm save" success.

(3)Restore from maxmem=memory (maxmem=1024, memory=1024)
"xm restore" success. After restore, guest ping outside machine successfully, writing some files in guest is done successfully.

- Try follow scenarios on xen 105 build
(1)Set maxmem > memory (maxmem=1024, memory=512)
"xm save" success.

(2)Restore from maxmem > memory (maxmem=1024, memory=512)
xm restore" success. And after restore, guest ping outside machine successfully, writing some files in guest is done successfully.

(3)Set maxmem==memory (maxmem=1024, memory=1024)
"xm save" success.

(4)Restore from maxmem=memory (maxmem=1024, memory=1024)
"xm restore" success. And after restore, guest ping outside machine successfully, writing some files in guest is done successfully.

Comment 7 XinSun 2010-08-11 08:12:23 UTC
Created attachment 438116 [details]
xm_dmesg.txt

Add xm dmesg info

Comment 8 Miroslav Rezanina 2010-08-11 08:52:14 UTC
This problem is caused pby handling mfn page mapping error introduced in xen-3.0.3-112.el5 for bz 504278. As HVM save is broken, this failure on mapping error was introduced. Unfortunately, this error occurs when we try to save pv guest with memory < maxmem so save fails even if the save should be successfull.

Comment 9 Andrew Jones 2010-08-11 12:42:30 UTC
This is an interesting problem.

Miroslav's and my experiments show that for the problem to recreate you must have maxmem > memory, but the amount greater doesn't seem to change the following observations

xend.log gets messages like this

INFO (XendCheckpoint:375) cannot map mfn page 20000 gpfn 20000: Invalid argument

and 'xm dmesg' have corresponding messages like this

(XEN) /builddir/build/BUILD/kernel-2.6.18/xen/include/asm/mm.h:180:d0 Error pfn 
7c0fa: rd=f000000007c58080, od=0000000000000000, caf=0000000000000000, taf=00000
00000000000

You get more or less of those messages depending on the 'memory' parameter. For example (one message per unmapped page).

memory=256  ->  9 messages
memory=512  -> 17 messages
memory=1024 -> 33 messages
memory=2048 -> 65 messages

Furthermore the first mfn of the series of unmapped pages changes with 'memory'.

memory=256  ->  4000
memory=512  ->  8000
memory=1024 -> 10000
memory=2048 -> 20000


So there's a hole in the page map that gets created when 'memory' < 'maxmem' and it's size and location depend on the value of 'memory'. Still digging to try to find out why.

Comment 10 XinSun 2010-08-13 07:40:53 UTC
Today I try rhel5.5 HVM guest on ia64 platform with xen 115 build, "xm save" still fail even if maxmem=memory.
(1)Set maxmem==memory (maxmem=1024, memory=1024)
"xm save" fail

(2)Set maxmem > memory (maxmem=1024, memory=512)
"xm save" fail

I attach the "xm dmesg" info as xm_dmesg_hvm.txt

Comment 11 XinSun 2010-08-13 07:45:03 UTC
Created attachment 438614 [details]
xm_dmesg_hvm.txt

Attach hvm guest dmesg info: xm_dmesg_hvm.txt

Comment 12 Andrew Jones 2010-08-13 08:08:17 UTC
(In reply to comment #10)
> Today I try rhel5.5 HVM guest on ia64 platform with xen 115 build, "xm save"
> still fail even if maxmem=memory.

I'm not sure if hvm guests have ever worked for save+restore on ia64, or if it's even supposed to be supported. We could look for old bugs or try again with older versions to see if that's a regression, but if so, that's a different bug. This bug will focus on the PV save+restore.

Comment 13 Miroslav Rezanina 2010-08-13 08:29:07 UTC
(In reply to comment #12)
> (In reply to comment #10)
> > Today I try rhel5.5 HVM guest on ia64 platform with xen 115 build, "xm save"
> > still fail even if maxmem=memory.
> 
> I'm not sure if hvm guests have ever worked for save+restore on ia64, or if
> it's even supposed to be supported. We could look for old bugs or try again
> with older versions to see if that's a regression, but if so, that's a
> different bug. This bug will focus on the PV save+restore.    

No, HVM save/restore on ia64 is not working. This fix causing regression was introduce to report error on save - without it save of hvm ends with sucess even if the restore was not possible.

Comment 14 Andrew Jones 2010-08-13 13:43:54 UTC
A little more data to end the week on

--------------------------------------------------------------------------
maxmem=512, memory=256

(XEN) pte present from 3bf8 (3bf8) to 4009. 411 pages.
(XEN) invalid mfns from 4009 to 8000. 3ff7 pages.
(XEN) pte present total   = 00003fcb
(XEN) invalid mfns total  = 00004035
(XEN) total               = 00008000 (total pages = 00008000)

and have these xend logs

[2010-08-13 09:21:13 xend 2928] INFO (XendCheckpoint:375) cannot map mfn page
4000 gpfn 4000: Invalid argument
[2010-08-13 09:21:13 xend 2928] INFO (XendCheckpoint:375) cannot map mfn page
4001 gpfn 4001: Invalid argument
[2010-08-13 09:21:13 xend 2928] INFO (XendCheckpoint:375) cannot map mfn page
4002 gpfn 4002: Invalid argument
[2010-08-13 09:21:13 xend 2928] INFO (XendCheckpoint:375) cannot map mfn page
4003 gpfn 4003: Invalid argument
[2010-08-13 09:21:13 xend 2928] INFO (XendCheckpoint:375) cannot map mfn page
4004 gpfn 4004: Invalid argument
[2010-08-13 09:21:13 xend 2928] INFO (XendCheckpoint:375) cannot map mfn page
4005 gpfn 4005: Invalid argument
[2010-08-13 09:21:13 xend 2928] INFO (XendCheckpoint:375) cannot map mfn page
4006 gpfn 4006: Invalid argument
[2010-08-13 09:21:13 xend 2928] INFO (XendCheckpoint:375) cannot map mfn page
4007 gpfn 4007: Invalid argument
[2010-08-13 09:21:13 xend 2928] INFO (XendCheckpoint:375) cannot map mfn page
4008 gpfn 4008: Invalid argument

--------------------------------------------------------------------------
maxmem=256 == memory=256

(XEN) pte present from 3bf8 (3bf8) to 4000. 408 pages.
(XEN) pte present total   = 00003f9f
(XEN) invalid mfns total  = 00000061
(XEN) total               = 00004000 (total pages = 00004000)

No xend logs

--------------------------------------------------------------------------

So everything would work if the invalid mfns started at 4000, because there's
already code that handles that in libxc. So the question is why do they start
at 4009?

Note, the "memory=256 -> 4000" is clear to me now. Before I was thinking 4k
pages and it didn't make as much sense, but the ia64 is using 16k pages
(0x4000*16k = 256M).

Comment 15 Andrew Jones 2010-08-13 14:32:47 UTC
This is probably due to the page directory. The number of "problematic" pages relative to the amount of the 'memory' var adds up.

PAGE_SHIFT = 14
PTRS_PER_PGD = (1<<(PAGE_SHIFT-3)) = 2k
2k * 16k = 32M
256M/32M = 8

Still need to figure out why it's "problematic".

Comment 17 Paolo Bonzini 2010-09-16 07:38:57 UTC
This is a regression caused by the erroneous fix for bug 504278.

Moving back to 5.6 and assigning to Mirek who will simply revert the patch.

Comment 21 Miroslav Rezanina 2010-10-01 07:09:37 UTC
Fix built into  xen-3.0.3-117.el5

Comment 24 Yufang Zhang 2010-10-28 06:34:56 UTC
QA verified this bug on xen-3.0.3-117.el5:

On a ia64 machine, create a PV guest with memory=1024, maxmem=2048. The guest could be saved and restored successfully.

Change this bug to VERIFIED.

Comment 26 errata-xmlrpc 2011-01-13 22:23:35 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2011-0031.html


Note You need to log in before you can comment on or make changes to this bug.