Bug 663418

Summary: [RHEL5] [XEN] - Live migration of Xen DomUs succeeds but produces error messages
Product: Red Hat Enterprise Linux 5 Reporter: asilva <asilva>
Component: xenAssignee: Michal Novotny <minovotn>
Status: CLOSED INSUFFICIENT_DATA QA Contact: Virtualization Bugs <virt-bugs>
Severity: medium Docs Contact:
Priority: medium    
Version: 5.6CC: areis, drjones, jmunilla, jzheng, leiwang, minovotn, mrezanin, pbonzini, qwan, xen-maint, yuzhang, yuzhou
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-05-02 10:00:49 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 699616    

Description asilva 2010-12-15 18:13:15 UTC
> Description of problem:
Live migration succeeds, but xm dmesg on Dom0: 
---
(XEN) mm.c:654:d0 Error getting mfn f2d5b9 (pfn 6a2029) from L1 entry 8000000f2d5b9425 for dom7 
(XEN) mm.c:654:d0 Error getting mfn 11c164c (pfn 5d4048) from L1 entry 80000011c164c425 for dom7 
(XEN) mm.c:654:d0 Error getting mfn 11d0749 (pfn 6a584a) from L1 entry 80000011d0749425 for dom7 
(XEN) mm.c:654:d0 Error getting mfn 105b3c1 (pfn 2c145f) from L1 entry 800000105b3c1425 for dom7 
(XEN) mm.c:654:d0 Error getting mfn 11e34ba (pfn 8f1074) from L1 entry 80000011e34ba425 for dom7 
(XEN) mm.c:654:d0 Error getting mfn 1058716 (pfn 3ab7f2) from L1 entry 8000001058716425 for dom7 
(XEN) mm.c:654:d0 Error getting mfn 114fb76 (pfn 63127c) from L1 entry 800000114fb76425 for dom7 
(XEN) mm.c:654:d0 Error getting mfn 11c2d0d (pfn 21c4c9) from L1 entry 80000011c2d0d425 for dom7 
(XEN) mm.c:654:d0 Error getting mfn 11d3a70 (pfn 63cf20) from L1 entry 80000011d3a70425 for dom7 
(XEN) mm.c:654:d0 Error getting mfn 1016eca (pfn 29e7e4) from L1 entry 8000001016eca425 for dom7 
(XEN) printk: 11 messages suppressed. 
(XEN) mm.c:654:d0 Error getting mfn 7bfb03 (pfn 61b02a) from L1 entry 80000007bfb03425 for dom7 
(XEN) mm.c:654:d0 Error getting mfn 11f4cd9 (pfn 5c7e53) from L1 entry 80000011f4cd9425 for dom7 
--- 

> Version-Release number of selected component (if applicable):
Kernel - 2.6.18-194.17.4.el5xen

> How reproducible:
The problem is occurring in a HP DL 360 G6 with 12GB of ram. (customer machine)

We tried to reproduce the issue in house without no success. I used the following configuration.

Hardware:
Memory: 8GB and a 16GB server

Kernel-xen versions:
2.6.18-194.17.1.el5xen
2.6.18-194.17.4.el5xen
2.6.18-194.26.1.el5xen

In all versions the live migration worked perfectly without error messages.


> Actual results:
Understand why these messages are occurring in customer environment. 

> Expected results:
Live migration works perfectly without error messages.

> Additional info:

This how the error comes :

(XEN) ioapic_guest_write: apic=0, pin=4, old_irq=4, new_irq=4
(XEN) ioapic_guest_write: old_entry=000000f1, new_entry=000100f1
(XEN) ioapic_guest_write: Attempt to modify IO-APIC pin for in-use IRQ!
(XEN) mm.c:654:d0 Error getting mfn 7f0 (pfn 5555555555555555) from L1 entry 00000000007f0425 for dom32753
(XEN) mm.c:654:d0 Error getting mfn 7f0 (pfn 5555555555555555) from L1 entry 00000000007f0425 for dom32753
(XEN) mm.c:654:d0 Error getting mfn 7f0 (pfn 5555555555555555) from L1 entry 00000000007f0425 for dom32753

    /* Foreign mappings into guests in shadow external mode don't
     * contribute to writeable mapping refcounts.  (This allows the
     * qemu-dm helper process in dom0 to map the domain's memory without
     * messing up the count of "real" writable mappings.) */
    okay = (((l1e_get_flags(l1e) & _PAGE_RW) &&
             !(unlikely(paging_mode_external(d) && (d != current->domain))))
            ? get_page_and_type(page, d, PGT_writable_page)
            : get_page(page, d));
    if ( !okay )
    {
        MEM_LOG("Error getting mfn %lx (pfn %lx) from L1 entry %" PRIpte
                " for dom%d",
                mfn, get_gpfn_from_mfn(mfn),
                l1e_get_intpte(l1e), d->domain_id);
    }

    return okay;

My first interpretation would be the "page" exist but when "dom32753" is not able to access that page  because it wasn't mapped into the address space.

>
(XEN) mm.c:654:d0 Error getting mfn 7f0 (pfn 5555555555555555) from L1 entry 00000000007f0425 for dom32753
(XEN) mm.c:654:d0 Error getting mfn 7f0 (pfn 5555555555555555) from L1 entry 00000000007f0425 for dom32753
(XEN) mm.c:654:d0 Error getting mfn 7f0 (pfn 5555555555555555) from L1 entry 00000000007f0425 for dom32753
>


Here "dom32753" is a bogus value. It does not match id of any of the domain ( prev/non-prev dom) values , in other words it does not map to anything. 

There is an idle domain that gets started, but it is "32753" and 0x7ff1 is bogus.


#define IDLE_DOMAIN_ID   (0x7FFFU)

if it can't get a writable page (or get the page at all), then it can happen. That page might not be mapped.  Something is trampling over memory and makes bogus domain id.

Comment 1 Andrew Jones 2010-12-15 21:31:37 UTC
Please give guest details. Was it 32-bit? What kernel release? How much memory did it have allocated to it in its config?

Thanks,
Drew

Comment 2 Andrew Jones 2010-12-15 23:06:36 UTC
domid (0x7ff1) isn't bogus, it's dom_io

#define DOMID_IO   (0x7FF1U)

Now using it might be a bogus thing to do though...

What we know is that we went into get_page_from_l1e(), the page was present and had valid flags, but that the mfn wasn't valid, so we went into 

    if ( unlikely(!mfn_valid(mfn)) ||
         unlikely(page_get_owner(page) == dom_io) )
    {
...
       d = dom_io;
    }

upstream doesn't set d to dom_io for this case any more, not sense c/s 16402, which is a patch for foreign access to iomem pages. Maybe we shouldn't either?

So the error message makes sense. The pfn = 555... is the initial value for pfns ("an obvious debug pattern" per the sources). Therefore an invalid mfn could see that for the pfn and also a domid of 32753.

The remaining question is why are the mfns invalid? That question also applies to the more sane looking messages posted at the top of the description. I assume that that is because during the live migration we pulled the rug out from under a process accessing particular mfns when moving it to another machine (with different mfns assigned to it). That can be tested by live migrating a VM that isn't doing anything, or by migrating a busy machine, but not live, and then checking that the logs are clean. Perhaps when Alberto attempted to reproduce he attempted to live migrate a VM that wasn't doing anything?

Without knowing the type of guest, or what exactly it was doing at the time of migration, then it's hard to say if these error messages indicate a problem with migration or if they can be safely ignored. My understanding is that they can usually be safely ignored, because the guest kernel will generally BUG if it isn't prepared to have a failed page update.

Comment 3 asilva 2010-12-17 12:40:41 UTC
Hello Drew,

The error is occurring without execute a live migration. 

We are not able to figure out why this error occurs even if there is no live migration or guest machines in this xen host. 

I've attached the host sosreport, xen dmesg and xend-config.sxp. In xen dmesg you can see the error messages. 

> Perhaps when Alberto attempted to reproduce he attempted to live migrate a VM that wasn't doing anything?
A: Yes, my VM was idle during the test.

Cheers,
Alberto Silva

Comment 7 asilva 2010-12-17 13:23:40 UTC
There is a serial console configuration on Host...it can be related?

I did the tests using the same console configuration, but no errors. It may be specific to hardware.

Comment 10 Michal Novotny 2011-03-25 08:41:08 UTC
Alberto, I tried to get the machine to reproduce it but not luck. Do you have access to any machine to reproduce it now?

Thanks a lot!
Michal

Comment 14 Miroslav Rezanina 2011-05-02 10:00:49 UTC
As we have enough of information on this topic and it is about warning messages, closing this bz.