Created attachment 437556 [details] rhel5u5-ia64-pv Description of problem: On ia64 platform, the rhel5.5 pv guest will save failure when its maxmem > memory. But if we set maxmem=memory, "xm save" will success. Version-Release number of selected component (if applicable): xen-3.0.3-115.el5 kernel-xen-2.6.18-211.el5 How reproducible: Always Steps to Reproduce: 1.Edit a rhel5.5 pv guest to set maxmem > memory 2.Create the rhel5.5 pv guest: [host]# xm create rhel5u5-ia64-pv 3.Do "xm save" for this rhel5.5 pv guest [host]# xm save rhel5u5-ia64-pv save_rhel Actual results: "xm save" will fail: [host]# xm save rhel5u5-ia64-pv save_rhel Error: /usr/lib/xen/bin/xc_save 24 62 0 0 0 failed Usage: xm save <Domain> <CheckpointFile> Save a domain state to restore later. Expected results: "xm save" will success, and the save file will be generated. Additional info: Please see the attachment: rhel5u5-ia64-pv xend.log
Created attachment 437557 [details] xend.log Attachment xend.log
Hi XinSun, can you please try this scenario with xen-3.0.3-105.el5 version. We do some changes to maxmem handlig and I'd like to find out if this is related to it.
(In reply to comment #2) > Hi XinSun, can you please try this scenario with xen-3.0.3-105.el5 version. We > do some changes to maxmem handlig and I'd like to find out if this is related > to it. Try this scenario with xen-3.0.3-105.el5 on ia64 platform again. With 105 build, "xm save" success when maxmem > memory (maxmem=1024, memroy=512).
With -105 does it both save and restore correctly? And in the case with -115, but maxmem==memory does it both save and restore correctly? And when doing the test where maxmem==memory are you using the 1024, or the 512? (just to make sure that the max amount of memory is working we should be using the 1024). There were a handful of changes that touched maxmem between -105 and -115. Perhaps the most suspicious for this problem is the last one though, since it's on the restore path. Could you please try -114? Also, when you reproduce the problem, could you please capture 'xm dmesg'. Thanks, Drew
- Try follow scenarios on xen 114 build: (1)Set maxmem > memory (maxmem=1024, memory=512) "xm save" fail, and you can see the xm_dmesg.txt file (2)Set maxmem==memory (maxmem=1024, memory=1024) "xm save" success. (3)Restore from maxmem=memory (maxmem=1024, memory=1024) "xm restore" success. After restore, guest ping outside machine successfully, writing some files in guest is done successfully. - Try follow scenarios on xen 105 build (1)Set maxmem > memory (maxmem=1024, memory=512) "xm save" success. (2)Restore from maxmem > memory (maxmem=1024, memory=512) xm restore" success. And after restore, guest ping outside machine successfully, writing some files in guest is done successfully. (3)Set maxmem==memory (maxmem=1024, memory=1024) "xm save" success. (4)Restore from maxmem=memory (maxmem=1024, memory=1024) "xm restore" success. And after restore, guest ping outside machine successfully, writing some files in guest is done successfully.
Created attachment 438116 [details] xm_dmesg.txt Add xm dmesg info
This problem is caused pby handling mfn page mapping error introduced in xen-3.0.3-112.el5 for bz 504278. As HVM save is broken, this failure on mapping error was introduced. Unfortunately, this error occurs when we try to save pv guest with memory < maxmem so save fails even if the save should be successfull.
This is an interesting problem. Miroslav's and my experiments show that for the problem to recreate you must have maxmem > memory, but the amount greater doesn't seem to change the following observations xend.log gets messages like this INFO (XendCheckpoint:375) cannot map mfn page 20000 gpfn 20000: Invalid argument and 'xm dmesg' have corresponding messages like this (XEN) /builddir/build/BUILD/kernel-2.6.18/xen/include/asm/mm.h:180:d0 Error pfn 7c0fa: rd=f000000007c58080, od=0000000000000000, caf=0000000000000000, taf=00000 00000000000 You get more or less of those messages depending on the 'memory' parameter. For example (one message per unmapped page). memory=256 -> 9 messages memory=512 -> 17 messages memory=1024 -> 33 messages memory=2048 -> 65 messages Furthermore the first mfn of the series of unmapped pages changes with 'memory'. memory=256 -> 4000 memory=512 -> 8000 memory=1024 -> 10000 memory=2048 -> 20000 So there's a hole in the page map that gets created when 'memory' < 'maxmem' and it's size and location depend on the value of 'memory'. Still digging to try to find out why.
Today I try rhel5.5 HVM guest on ia64 platform with xen 115 build, "xm save" still fail even if maxmem=memory. (1)Set maxmem==memory (maxmem=1024, memory=1024) "xm save" fail (2)Set maxmem > memory (maxmem=1024, memory=512) "xm save" fail I attach the "xm dmesg" info as xm_dmesg_hvm.txt
Created attachment 438614 [details] xm_dmesg_hvm.txt Attach hvm guest dmesg info: xm_dmesg_hvm.txt
(In reply to comment #10) > Today I try rhel5.5 HVM guest on ia64 platform with xen 115 build, "xm save" > still fail even if maxmem=memory. I'm not sure if hvm guests have ever worked for save+restore on ia64, or if it's even supposed to be supported. We could look for old bugs or try again with older versions to see if that's a regression, but if so, that's a different bug. This bug will focus on the PV save+restore.
(In reply to comment #12) > (In reply to comment #10) > > Today I try rhel5.5 HVM guest on ia64 platform with xen 115 build, "xm save" > > still fail even if maxmem=memory. > > I'm not sure if hvm guests have ever worked for save+restore on ia64, or if > it's even supposed to be supported. We could look for old bugs or try again > with older versions to see if that's a regression, but if so, that's a > different bug. This bug will focus on the PV save+restore. No, HVM save/restore on ia64 is not working. This fix causing regression was introduce to report error on save - without it save of hvm ends with sucess even if the restore was not possible.
A little more data to end the week on -------------------------------------------------------------------------- maxmem=512, memory=256 (XEN) pte present from 3bf8 (3bf8) to 4009. 411 pages. (XEN) invalid mfns from 4009 to 8000. 3ff7 pages. (XEN) pte present total = 00003fcb (XEN) invalid mfns total = 00004035 (XEN) total = 00008000 (total pages = 00008000) and have these xend logs [2010-08-13 09:21:13 xend 2928] INFO (XendCheckpoint:375) cannot map mfn page 4000 gpfn 4000: Invalid argument [2010-08-13 09:21:13 xend 2928] INFO (XendCheckpoint:375) cannot map mfn page 4001 gpfn 4001: Invalid argument [2010-08-13 09:21:13 xend 2928] INFO (XendCheckpoint:375) cannot map mfn page 4002 gpfn 4002: Invalid argument [2010-08-13 09:21:13 xend 2928] INFO (XendCheckpoint:375) cannot map mfn page 4003 gpfn 4003: Invalid argument [2010-08-13 09:21:13 xend 2928] INFO (XendCheckpoint:375) cannot map mfn page 4004 gpfn 4004: Invalid argument [2010-08-13 09:21:13 xend 2928] INFO (XendCheckpoint:375) cannot map mfn page 4005 gpfn 4005: Invalid argument [2010-08-13 09:21:13 xend 2928] INFO (XendCheckpoint:375) cannot map mfn page 4006 gpfn 4006: Invalid argument [2010-08-13 09:21:13 xend 2928] INFO (XendCheckpoint:375) cannot map mfn page 4007 gpfn 4007: Invalid argument [2010-08-13 09:21:13 xend 2928] INFO (XendCheckpoint:375) cannot map mfn page 4008 gpfn 4008: Invalid argument -------------------------------------------------------------------------- maxmem=256 == memory=256 (XEN) pte present from 3bf8 (3bf8) to 4000. 408 pages. (XEN) pte present total = 00003f9f (XEN) invalid mfns total = 00000061 (XEN) total = 00004000 (total pages = 00004000) No xend logs -------------------------------------------------------------------------- So everything would work if the invalid mfns started at 4000, because there's already code that handles that in libxc. So the question is why do they start at 4009? Note, the "memory=256 -> 4000" is clear to me now. Before I was thinking 4k pages and it didn't make as much sense, but the ia64 is using 16k pages (0x4000*16k = 256M).
This is probably due to the page directory. The number of "problematic" pages relative to the amount of the 'memory' var adds up. PAGE_SHIFT = 14 PTRS_PER_PGD = (1<<(PAGE_SHIFT-3)) = 2k 2k * 16k = 32M 256M/32M = 8 Still need to figure out why it's "problematic".
This is a regression caused by the erroneous fix for bug 504278. Moving back to 5.6 and assigning to Mirek who will simply revert the patch.
Fix built into xen-3.0.3-117.el5
QA verified this bug on xen-3.0.3-117.el5: On a ia64 machine, create a PV guest with memory=1024, maxmem=2048. The guest could be saved and restored successfully. Change this bug to VERIFIED.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2011-0031.html