Created attachment 440585 [details]
Description of problem:
Save a paused PV guest will get error message to prevent save in this way, but after that the PV guest's console became unavailable, also could see domains' name changed to migrating-xxx in qemu-dm process, got the error below in xend.log:
Traceback (most recent call last):
File "/usr/lib64/python2.4/site-packages/xen/xend/XendDomainInfo.py", line 2300, in resumeDomain
Error: (1, 'Internal error', "Couldn't map start_info")
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1.create a PV guest
2.xm pause the PV guest
3.xm save the paused PV guest
4.virt-viewer or vncviewer check the PV guest
5.xm console PV guest
6.ps aux |grep qemu
The PV guest's console became unavailable after save, though there's Error message prevent save a paused domain. At step 6, notice that domain's name changed to migrating-xxx in qemu-dm process.
The PV guest should keep previous status, console available.
HVM guests do not have this issue.
xm-dmesg,xend.log qemu-dm.*.log and an example of xen pv guest config file as attached.
Created attachment 440586 [details]
Created attachment 440587 [details]
xen pv guest config file
Created attachment 440588 [details]
pv guest qemu-dm log
The pv guests became Zombie after xm destroy them.
Name ID Mem(MiB) VCPUs State Time(s)
Domain-0 0 1761 2 r----- 221.2
Zombie-rhel-5.5-32-pv 1 512 4 -bp--d 15.2
Zombie-rhel-5.5-64-pv 3 512 4 -bp--d 10.9
Zombie-rhel6.0-64-pv 5 512 4 -bp--d 7.0
Well, I investigated this one and those zombies requires restarting xend service and this is because of event channel is not providing the value required to do the proper domain cleanup which is kernel-xen bug 589123. Without fixing this one PV guests cannot be destroyed/cleaned up successfully and therefore it leaves zombies.
The message it's failing on the start_info structure mapping is basically coming from xc_map_foreign_range() function which uses IOCTL_PRIVCMD_MMAP call to the xen kernel. If the ioctl() for this is less than zero, i.e. negative value, it's returning NULL that means the "Couldn't map start_info" code is being hit.
The IOCTL_PRIVCMD_MMAP is being implemented in the drivers/xen/privcmd/privcmd.c of kernel source codes (not the hypervisor as I expected) so I guess the debug code needs to be added to this one.
I did one debug patch to libxc itself to determine the error being printed and the code of failure seems to be 14 (Bad address according to perror output) - according to errno man page it's code is EFAULT [EFAULT Bad address (POSIX.1)] which is there on 2 occasions - when it fails from copying the mmapcmd (type privcmd_mmap_t) or msg (type privcmd_mmap_entry_t) from user-space. The msg is also called before and within a loop and it can be coming from direct_remap_pfn_range() function - the result value of this function is being assigned to the rc variable. So there are basically 4 places where the debugging should be added to see what's going on there.
Well, I've created a patch for drivers/xen/privcmd/privcmd.c file with all the places to see where is it failing using the standard printk() function of the kernel. Unfortunately this is merged into the xen kernel itself rather than in the hypervisor so it took me some time to recompile it and make my custom kernel (which is based on -213.el5 RHEL-5 kernel) working because of various installation issues but finally I was able to make it working with the debug output added to privcmd. The message there was:
MMAP: Error in direct_remap_pfn_range = -14
which means that this is coming from the direct_remap_pfn_range() function so that's where I continued my investigation. According to grepping in the RHEL-5 kernel git tree it appears to be defined in the arch/i386/mm/ioremap-xen.c source code file or arch/ia64/xen/hypervisor.c source code file so I guess the code is the same for both i386 and x86_64 versions of the system so this is the place to take further investigation.
However, the interesting thing is that I checked `xm dmesg` output now and I found out that there was one new line connected to the resume operation. The line was saying:
(XEN) mm.c:630:d0 Non-privileged (2) attempt to map I/O space 00000000
so I guess this is the next place to be investigated.
When testing it I saw some interesting messages even when pausing the guest. When I did pause the guest there were following lines:
(XEN) /xen/include/asm/mm.h:225:d1 Error pfn 1abca: rd=ffff8300cee1a080, od=ffff8300ceef2080, caf=80000002, taf=00000000e8000001
(XEN) mm.c:654:d1 Error getting mfn 1abca (pfn 1dcfca) from L1 entry 000000001abca025 for dom1
(XEN) mm.c:3621:d1 ptwr_emulate: fixing up invalid PAE PTE 000000001abca025
so I guess the emulation failed. Since this is about PV guest which is using a modified kernel I tried using some other guest. This was seen on rhel-5 i386 version so I tried with rhel6-64pv with debugging added to the hypervisor to see the results of mod_l1_entry() calls. There were many -EINVAL (-22) occurrences with following messages:
(XEN) memory.c:124:d0 Could not allocate order=0 extent: id=2 memflags=0 (262139 of 262144)
(XEN) memory.c:124:d0 Could not allocate order=0 extent: id=3 memflags=0 (262139 of 262144)
and when I paused the guest there were no new messages like in case of rhel5 i386 however when I tried to save the guest there was the same message like I wrote above:
(XEN) mm.c:630:d0 Non-privileged (4) attempt to map I/O space 00000000
I've tried to see what does the values mean and domain 4 is the domid whereas the I/O space value is basically the mfn value. This message about I/O space was seen on all the guests I've tried, i.e. rhel5 i386, rhel4 x86_64 and rhel6 x86_64. The -EINVAL messages were seen only on rhel6 but no other version.
Moreover, the non-privileged attempt message is going from the get_page_from_l1e() function in the hypervisor and according to the code the iomem_access_permitted(d, mfn, mfn) is failing.
When I paused the domain I saw that the MFN was reset to 0x0:
(XEN) mm.c:626:d0 GPFL: Domain ID = 3, MFN = 00000000
(XEN) mm.c:632:d0 Non-privileged (3) attempt to map I/O space 00000000
but the other GPFL (abbreviation of get_page_from_l1e() as the function name) were having attempts only from domain 0, not other domain which I think is the fail reason. When I tried to run save on a non-paused domain there was no call to get_page_from_l1e().
Also, another strange thing was probably made by my changes but I was unable to restore the PV guest from the save file because of:
ERROR Internal error: Unable to lock ctxt [(-1)]
message coming directly from the xc_restore helper utility which calls lock_pages() that's basically a mlock() call so I added debugging there and it returned value -1 (-EPERM). There is no other occurrence of "Unable to lock ctxt" string in the file.
Excerpt from mlock man page:
EPERM (Linux 2.6.9 and later) the caller was not privileged (CAP_IPC_LOCK) and its RLIMIT_MEMLOCK soft resource limit was 0.
Since Linux 2.6.9, no limits are placed on the amount of memory that a privileged process can lock and the RLIMIT_MEMLOCK soft
resource limit instead defines a limit on how much memory an unprivileged process may lock.
So I don't know what's going on but the mlock() function failed with no privileges. This will require some further investigation when it happens but I guess it's done by me adding the debugging stuff and patching both libxc and hypervisor/kernel a little to get some more output from the testing.
I would rather check in xend's save path, and make sure the error is given as soon as possible, before anything is done on the domain.
Well, that's strange. Everything about this issue seems to be connected to the hypervisor and done inside of the get_page_from_l1e() function. I did the testing to pause PV guest, then unpause and save. Everything was working as expected but when I paused the PV guest and tried to save it it ended up with the "Cannot map start_info" message and I saw the non-privileged attempt message in the xm dmesg output.
The rare thing here is that the domain structure is for domain id higher than 0 so it appears that the bad domain structure is being passed to the get_page_from_l1e() function.
I've also found out that the relevant get_page_from_l1e() call is located in the mod_l1_entry() function on following lines:
if ( unlikely(!get_page_from_l1e(nl1e, FOREIGNDOM)) )
The FOREIGNDOM->domain_id is set to the PV domain ID. I added some extra-debugging into get_page_from_l1e() function to see what's happening and why is it failing and where exactly.
The debug output is showing following lines:
(XEN) mm.c:631:d0 Page: ffff828400000028
(XEN) mm.c:625:d0 p = dom_io: Domain ID = 1, MFN = 00000001
(XEN) mm.c:634:d0 GPFL: Domain ID = 1, MFN = 00000001
(XEN) mm.c:640:d0 Non-privileged (1) attempt to map I/O space 00000001
where the condition to print "p = dom_io" string there is defined as:
if (unlikely(page_get_owner(page) == dom_io))
MEM_LOG("p = dom_io: Domain ID = %u, MFN = %08lx", d->domain_id, mfn);
and dom_io seems to be a privileged domain for I/O paging defined with the DOMID_IO = 0x7FF1U which is 32753 decimal and it's being allocated using alloc_domain() in common/domain.c source code file (which is being called by arch_init_memory() function defined in arch/x86/mm.c file for both i386 and x86_64 architectures - ia64 does have the different handling). The page_get_owner() function gets the domain structure for the page variable which belongs to the particular mfn value (as defined by: page = mfn_to_page(mfn) line) and mfn is defined by "mfn = l1e_get_pfn(l1e)" line.
The page_get_owner() macro is defined as:
#define page_get_owner(_p) ( ((_p == 0) || (_p & 1)) ? NULL : ((void *)((unsigned long)(_p)+ ((((262 ## UL) >> 8) * 0xffff000000000000UL) | (262 ## UL << 39)) ))->u.inuse._domain) )
and since the page was having much higher value the page is associated to memory at ((unsigned long)(_p)+((((262 ## UL) >> 8) * 0xffff000000000000UL) | (262 ## UL << 39)) )->u.inuse._domain) ) location which is equal to the dom_io location. I'm afraid this is the issue since it shouldn't either be equal to this location or it's not properly handled.
Moreover,the cannot restore issue seems to be introduced by my patch that was merged into the latest RHEL-5 kernel and xen git trees.
This is about the resume issue of the guest but the better solution is to disallow save operation at the beginning of the path with no operation done to the domain yet.
(In reply to comment #8)
> I would rather check in xend's save path, and make sure the error is given as
> soon as possible, before anything is done on the domain.
Yes, this path is better to fix it in the tools directly. Also, upstream is using the suspend cancellation thing so it's not the issue there but for us it won't be that easy so fixing this in the user-space xen tools is better way to go.
Created attachment 441230 [details]
Fix paused domain handling and introduce domain_shutdown() for XML-RPC
This is the patch to introduce domain_shutdown() method in XendDomain class and to fix save/migrate/shutdown in the right places including the precise error messages.
RHEL5.5 x86_64 and i386 host
RHEL5.5 and RHEL6.0 PV guest
This issue can be reproduced with xen-3.0.3-115.el5.
And fixed with xen-3.0.3-116.el5.
After save fail,guest's console still available and no start_info error found in xend.log. Also can save and restore the PV guest successfully after unpause the guest.
So move the bug to VERIFIED.
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.