Description of problem: IBM is reporting that sometimes a migrate/live migrate fails between two machines. Their full bug report is below: LTC Owner is: nivedita.com LTC Originator is: dmsmith1.com ---Problem Description--- Migration between two identical machines fails Contact Information = Dan Smith (danms.com) ---uname output--- Linux elm3b194.beaverton.ibm.com 2.6.18-1.2739.el5xen #1 SMP Thu Oct 26 16:30:56 EDT 2006 i686 i686 i386 GNU/Linux Machine Type = HS20: 8843-21U ---Debugger--- A debugger is not configured ---Steps to Reproduce--- Create a domain with the xm-test ramdisk, 64MB of memory, migrate it from one machine to the other. If it doesn't fail, repeat until it does. Failure is when the domain disappears from the local machine and does not show up on the remote. ---Xen Component Data--- Userspace tool common name: xm / xc_restore The userspace tool has the following bit modes: 32-bit Userspace rpm: xen xend.log on the receiving machine: [2006-11-08 23:25:49 xend 2477] DEBUG (XendCheckpoint:155) [xc_restore]: /usr/l$[2006-11-08 23:25:49 xend 2477] ERROR (XendCheckpoint:236) xc_linux_restore sta$[2006-11-08 23:25:49 xend 2477] ERROR (XendCheckpoint:236) Increased domain res$[2006-11-08 23:25:49 xend 2477] ERROR (XendCheckpoint:236) Reloading memory pag$[2006-11-08 23:25:50 xend 2477] ERROR (XendCheckpoint:236) Received all pages ($[2006-11-08 23:25:50 xend 2477] ERROR (XendCheckpoint:236) Failed to pin batch $[2006-11-08 23:25:50 xend 2477] ERROR (XendCheckpoint:236) Restore exit with rc$[2006-11-08 23:25:50 xend.XendDomainInfo 2477] DEBUG (XendDomainInfo:1449) Xend$[2006-11-08 23:25:50 xend.XendDomainInfo 2477] DEBUG (XendDomainInfo:1457) Xend$[2006-11-08 23:25:50 xend.XendDomainInfo 2477] ERROR (XendDomainInfo:1463) Xend$Traceback (most recent call last): File "/usr/lib/python2.4/site-packages/xen/xend/XendDomainInfo.py", line 1461$ xc.domain_destroy(self.domid) Error: (3, 'No such process') [2006-11-08 23:25:50 xend 2477] ERROR (XendDomain:268) Restore failed Traceback (most recent call last): File "/usr/lib/python2.4/site-packages/xen/xend/XendDomain.py", line 263, in $ return XendCheckpoint.restore(self, fd) File "/usr/lib/python2.4/site-packages/xen/xend/XendCheckpoint.py", line 159,$ forkHelper(cmd, fd, handler.handler, True) File "/usr/lib/python2.4/site-packages/xen/xend/XendCheckpoint.py", line 227,$ raise XendError("%s failed" % string.join(cmd)) XendError: /usr/lib/xen/bin/xc_restore 16 1 18432 1 2 failed xm dmesg output: (XEN) DOM0: (file=mm.c, line=572) Bad L1 flags 80000000 (XEN) DOM0: (file=mm.c, line=850) Failure in alloc_l1_table: entry 13 (XEN) DOM0: (file=mm.c, line=1707) Error while validating mfn 1af1b (pfn ba2) f$ (XEN) DOM0: (file=mm.c, line=998) Failure in alloc_l2_table: entry 3 (XEN) DOM0: (file=mm.c, line=1707) Error while validating mfn 13762 (pfn 75b) f$ (XEN) DOM0: (file=mm.c, line=1063) Failure in alloc_l3_table: entry 3 (XEN) DOM0: (file=mm.c, line=1707) Error while validating mfn 1375e (pfn 75f) f$ (XEN) DOM0: (file=mm.c, line=1985) Error while pinning mfn 1375e Sending machine's cpuinfo (x4): processor : 0 vendor_id : GenuineIntel cpu family : 15 model : 4 model name : Intel(R) Xeon(TM) CPU 3.20GHz stepping : 1 cpu MHz : 3200.226 cache size : 1024 KB fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 5 wp : yes flags : fpu tsc msr pae mce cx8 apic mtrr mca cmov pat pse36 clflush $bogomips : 8004.69 Receving machine's cpuinfo (x4): processor : 0 vendor_id : GenuineIntel cpu family : 15 model : 4 model name : Intel(R) Xeon(TM) CPU 3.20GHz stepping : 1 cpu MHz : 3200.244 cache size : 1024 KB fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 5 wp : yes flags : fpu tsc msr pae mce cx8 apic mtrr mca cmov pat pse36 clflush $bogomips : 8003.08 This problem only occurs on the RHEL5 kernel at this time. The same test on xen-unstable bits gets further but fails at a different point due to other issues. This failure is hit on RHEL5 usually within 5 or so tries - often on first try... Still present in 20061102...
Are those messages on xm dmesg referent to a single failure, or multiple ones? Are your logs always getting messy, with messages getting truncated this way? If yes, maybe we should fill a bug separatedely for it. This is rendering the log useless.
From what I was able to dig up to now, the mfn we're trying to pin to an L1 table has a forbidden flag. Moreover, this flag is an unused flag. (XEN) DOM0: (file=mm.c, line=572) Bad L1 flags 80000000 (Flags are currently only defined up to 4 digits). What remains to be answered, is how did such a flag get there? Does the failing also happen with normal linux guests, or is it just with the xm-test instance?
What I have so far: alloc_l3_table calls get_page_and_type_from_pagenr(l3e_get_pfn(pl3e[i]), note the ===> PGT_l2_page_table | flags ===> PGT_pae_xen_l2, d) ) which in turn, calls get_page_type() with unmodifyied flags. This last one, has the following piece of code: ASSERT(!(x & PGT_pae_xen_l2)); It's probably with ASSERTS turnt off, thus not bugging. But it gives strong reasons to believe there may be something wrong here. This call in alloc_l3_page() is only made in the 3rd page index. And according to your report, (XEN) DOM0: (file=mm.c, line=1063) Failure in alloc_l3_table: entry 3 All of those are inside conditionals so I'm not 100 % sure. they're rather good clues that I'll be following next days.
Aren't you able to see the whole line in xm dmesg log ?
Created attachment 143303 [details] xend.log-failed_migration
Created attachment 143304 [details] xm_dmesg-failed_migration
Please, test the rpms that can be found at http://et.redhat.com/~gcosta/ and see if the problem still persists. Thanks.
Please update the status of this issue with either the package in #12 or the retest in #14.
Can't really ack until we have more details.
Closing out based on comment 20.