Bug 626716

Summary: save a paused PV guest cause start_info page lost
Product: Red Hat Enterprise Linux 5 Reporter: Lei Wang <leiwang>
Component: xenAssignee: Michal Novotny <minovotn>
Status: CLOSED ERRATA QA Contact: Virtualization Bugs <virt-bugs>
Severity: medium Docs Contact:
Priority: medium    
Version: 5.6CC: areis, llim, minovotn, mshao, pbonzini, qwan, xen-maint, yuzhang
Target Milestone: rc   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-01-13 22:23:55 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
xend.log
none
xm-dmesg
none
xen pv guest config file
none
pv guest qemu-dm log
none
Fix paused domain handling and introduce domain_shutdown() for XML-RPC none

Description Lei Wang 2010-08-24 07:58:16 UTC
Created attachment 440585 [details]
xend.log

Description of problem:
Save a paused PV guest will get error message to prevent save in this way, but after that the PV guest's console became unavailable, also could see domains' name changed to migrating-xxx in qemu-dm process, got the error below in xend.log:
Traceback (most recent call last):
  File "/usr/lib64/python2.4/site-packages/xen/xend/XendDomainInfo.py", line 2300, in resumeDomain
    xc.domain_resume(self.domid, fast)
Error: (1, 'Internal error', "Couldn't map start_info")

Version-Release number of selected component (if applicable):
xen-3.0.3-115.el5
kernel-xen-2.6.18-212.el5

How reproducible:
Always

Steps to Reproduce:
1.create a PV guest
2.xm pause the PV guest
3.xm save the paused PV guest
4.virt-viewer or vncviewer check the PV guest
5.xm console PV guest
6.ps aux |grep qemu

Actual results:
The PV guest's console became unavailable after save, though there's Error message prevent save a paused domain. At step 6, notice that domain's name changed to migrating-xxx in qemu-dm process.

Expected results:
The PV guest should keep previous status, console available.

Additional info:
HVM guests do not have this issue.
xm-dmesg,xend.log qemu-dm.*.log and an example of xen pv guest config file as attached.

Comment 1 Lei Wang 2010-08-24 07:59:29 UTC
Created attachment 440586 [details]
xm-dmesg

Comment 2 Lei Wang 2010-08-24 08:00:06 UTC
Created attachment 440587 [details]
xen pv guest config file

Comment 3 Lei Wang 2010-08-24 08:01:09 UTC
Created attachment 440588 [details]
pv guest qemu-dm log

Comment 4 Lei Wang 2010-08-24 08:09:52 UTC
The pv guests became Zombie after xm destroy them.

#xm list
Name                                      ID Mem(MiB) VCPUs State   Time(s)

Domain-0                                   0     1761     2 r-----    221.2

Zombie-rhel-5.5-32-pv                      1      512     4 -bp--d     15.2

Zombie-rhel-5.5-64-pv                      3      512     4 -bp--d     10.9

Zombie-rhel6.0-64-pv                       5      512     4 -bp--d      7.0

Comment 5 Michal Novotny 2010-08-24 15:31:09 UTC
Well, I investigated this one and those zombies requires restarting xend service and this is because of event channel is not providing the value required to do the proper domain cleanup which is kernel-xen bug 589123. Without fixing this one PV guests cannot be destroyed/cleaned up successfully and therefore it leaves zombies.

The message it's failing on the start_info structure mapping is basically coming from xc_map_foreign_range() function which uses IOCTL_PRIVCMD_MMAP call to the xen kernel. If the ioctl() for this is less than zero, i.e. negative value, it's returning NULL that means the "Couldn't map start_info" code is being hit.

The IOCTL_PRIVCMD_MMAP is being implemented in the drivers/xen/privcmd/privcmd.c of kernel source codes (not the hypervisor as I expected) so I guess the debug code needs to be added to this one.

I did one debug patch to libxc itself to determine the error being printed and the code of failure seems to be 14 (Bad address according to perror output) - according to errno man page it's code is EFAULT [EFAULT Bad address (POSIX.1)] which is there on 2 occasions - when it fails from copying the mmapcmd (type privcmd_mmap_t) or msg (type privcmd_mmap_entry_t) from user-space. The msg is also called before and within a loop and it can be coming from direct_remap_pfn_range() function - the result value of this function is being assigned to the rc variable. So there are basically 4 places where the debugging should be added to see what's going on there.

Michal

Comment 6 Michal Novotny 2010-08-25 14:18:55 UTC
Well, I've created a patch for drivers/xen/privcmd/privcmd.c file with all the places to see where is it failing using the standard printk() function of the kernel. Unfortunately this is merged into the xen kernel itself rather than in the hypervisor so it took me some time to recompile it and make my custom kernel (which is based on -213.el5 RHEL-5 kernel) working because of various installation issues but finally I was able to make it working with the debug output added to privcmd. The message there was:

MMAP: Error in direct_remap_pfn_range = -14

which means that this is coming from the direct_remap_pfn_range() function so that's where I continued my investigation. According to grepping in the RHEL-5 kernel git tree it appears to be defined in the arch/i386/mm/ioremap-xen.c source code file or arch/ia64/xen/hypervisor.c source code file so I guess the code is the same for both i386 and x86_64 versions of the system so this is the place to take further investigation.

However, the interesting thing is that I checked `xm dmesg` output now and I found out that there was one new line connected to the resume operation. The line was saying:

(XEN) mm.c:630:d0 Non-privileged (2) attempt to map I/O space 00000000

so I guess this is the next place to be investigated.

Michal

Comment 7 Michal Novotny 2010-08-25 16:34:11 UTC
When testing it I saw some interesting messages even when pausing the guest. When I did pause the guest there were following lines:

(XEN) /xen/include/asm/mm.h:225:d1 Error pfn 1abca: rd=ffff8300cee1a080, od=ffff8300ceef2080, caf=80000002, taf=00000000e8000001
(XEN) mm.c:654:d1 Error getting mfn 1abca (pfn 1dcfca) from L1 entry 000000001abca025 for dom1
(XEN) mm.c:3621:d1 ptwr_emulate: fixing up invalid PAE PTE 000000001abca025

so I guess the emulation failed. Since this is about PV guest which is using a modified kernel I tried using some other guest. This was seen on rhel-5 i386 version so I tried with rhel6-64pv with debugging added to the hypervisor to see the results of mod_l1_entry() calls. There were many -EINVAL (-22) occurrences with following messages:

(XEN) memory.c:124:d0 Could not allocate order=0 extent: id=2 memflags=0 (262139 of 262144)
(XEN) memory.c:124:d0 Could not allocate order=0 extent: id=3 memflags=0 (262139 of 262144)

and when I paused the guest there were no new messages like in case of rhel5 i386 however when I tried to save the guest there was the same message like I wrote above:

(XEN) mm.c:630:d0 Non-privileged (4) attempt to map I/O space 00000000

I've tried to see what does the values mean and domain 4 is the domid whereas the I/O space value is basically the mfn value. This message about I/O space was seen on all the guests I've tried, i.e. rhel5 i386, rhel4 x86_64 and rhel6 x86_64. The -EINVAL messages were seen only on rhel6 but no other version.

Moreover, the non-privileged attempt message is going from the get_page_from_l1e() function in the hypervisor and according to the code the iomem_access_permitted(d, mfn, mfn) is failing.

When I paused the domain I saw that the MFN was reset to 0x0:

(XEN) mm.c:626:d0 GPFL: Domain ID = 3, MFN = 00000000
(XEN) mm.c:632:d0 Non-privileged (3) attempt to map I/O space 00000000

but the other GPFL (abbreviation of get_page_from_l1e() as the function name) were having attempts only from domain 0, not other domain which I think is the fail reason. When I tried to run save on a non-paused domain there was no call to get_page_from_l1e().

Also, another strange thing was probably made by my changes but I was unable to restore the PV guest from the save file because of:

ERROR Internal error: Unable to lock ctxt [(-1)]

message coming directly from the xc_restore helper utility which calls lock_pages() that's basically a mlock() call so I added debugging there and it returned value -1 (-EPERM). There is no other occurrence of "Unable to lock ctxt" string in the file.

Excerpt from mlock man page:

EPERM  (Linux 2.6.9 and later) the caller was not privileged (CAP_IPC_LOCK) and its RLIMIT_MEMLOCK soft resource limit was 0.

And also:

       Since Linux 2.6.9, no limits are placed on the amount of memory that a privileged process can lock and the  RLIMIT_MEMLOCK  soft
       resource limit instead defines a limit on how much memory an unprivileged process may lock.

So I don't know what's going on but the mlock() function failed with no privileges. This will require some further investigation when it happens but I guess it's done by me adding the debugging stuff and patching both libxc and hypervisor/kernel a little to get some more output from the testing.

Michal

Comment 8 Paolo Bonzini 2010-08-26 13:00:24 UTC
I would rather check in xend's save path, and make sure the error is given as soon as possible, before anything is done on the domain.

Comment 9 Michal Novotny 2010-08-26 13:26:59 UTC
Well, that's strange. Everything about this issue seems to be connected to the hypervisor and done inside of the get_page_from_l1e() function. I did the testing to pause PV guest, then unpause and save. Everything was working as expected but when I paused the PV guest and tried to save it it ended up with the "Cannot map start_info" message and I saw the non-privileged attempt message in the xm dmesg output.

The rare thing here is that the domain structure is for domain id higher than 0 so it appears that the bad domain structure is being passed to the get_page_from_l1e() function.

I've also found out that the relevant get_page_from_l1e() call is located in the mod_l1_entry() function on following lines:

        if ( unlikely(!get_page_from_l1e(nl1e, FOREIGNDOM)) )
            return 0;

The FOREIGNDOM->domain_id is set to the PV domain ID. I added some extra-debugging into get_page_from_l1e() function to see what's happening and why is it failing and where exactly.

The debug output is showing following lines:

(XEN) mm.c:631:d0 Page: ffff828400000028
(XEN) mm.c:625:d0 p = dom_io: Domain ID = 1, MFN = 00000001
(XEN) mm.c:634:d0 GPFL: Domain ID = 1, MFN = 00000001
(XEN) mm.c:640:d0 Non-privileged (1) attempt to map I/O space 00000001

where the condition to print "p = dom_io" string there is defined as:

if (unlikely(page_get_owner(page) == dom_io))
  MEM_LOG("p = dom_io: Domain ID = %u, MFN = %08lx", d->domain_id, mfn);

and dom_io seems to be a privileged domain for I/O paging defined with the DOMID_IO = 0x7FF1U which is 32753 decimal and it's being allocated using alloc_domain() in common/domain.c source code file (which is being called by arch_init_memory() function defined in arch/x86/mm.c file for both i386 and x86_64 architectures - ia64 does have the different handling). The page_get_owner() function gets the domain structure for the page variable which belongs to the particular mfn value (as defined by: page = mfn_to_page(mfn) line) and mfn is defined by "mfn = l1e_get_pfn(l1e)" line.

The page_get_owner() macro is defined as:
#define page_get_owner(_p)  ( ((_p == 0) || (_p & 1)) ? NULL : ((void *)((unsigned long)(_p)+ ((((262 ## UL) >> 8) * 0xffff000000000000UL) | (262 ## UL << 39)) ))->u.inuse._domain) )

and since the page was having much higher value the page is associated to memory at ((unsigned long)(_p)+((((262 ## UL) >> 8) * 0xffff000000000000UL) | (262 ## UL << 39)) )->u.inuse._domain) ) location which is equal to the dom_io location. I'm afraid this is the issue since it shouldn't either be equal to this location or it's not properly handled.

Moreover,the cannot restore issue seems to be introduced by my patch that was merged into the latest RHEL-5 kernel and xen git trees.

This is about the resume issue of the guest but the better solution is to disallow save operation at the beginning of the path with no operation done to the domain yet.

Michal

Comment 10 Michal Novotny 2010-08-26 13:29:42 UTC
(In reply to comment #8)
> I would rather check in xend's save path, and make sure the error is given as
> soon as possible, before anything is done on the domain.

Yes, this path is better to fix it in the tools directly. Also, upstream is using the suspend cancellation thing so it's not the issue there but for us it won't be that easy so fixing this in the user-space xen tools is better way to go.

Michal

Comment 11 Michal Novotny 2010-08-26 14:31:49 UTC
Created attachment 441230 [details]
Fix paused domain handling and introduce domain_shutdown() for XML-RPC

This is the patch to introduce domain_shutdown() method in XendDomain class and to fix save/migrate/shutdown in the right places including the precise error messages.

Michal

Comment 15 Lei Wang 2010-09-10 06:18:06 UTC
Test with:
RHEL5.5 x86_64 and i386 host
RHEL5.5 and RHEL6.0 PV guest

This issue can be reproduced with xen-3.0.3-115.el5.

And fixed with xen-3.0.3-116.el5.
After save fail,guest's console still available and no start_info error found in xend.log. Also can save and restore the PV guest successfully after unpause the guest.

So move the bug to VERIFIED.

Comment 17 errata-xmlrpc 2011-01-13 22:23:55 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2011-0031.html