Bug 215372

Summary: Live/Non-live migration sometimes fails between two machines
Product: Red Hat Enterprise Linux 5 Reporter: Chris Lalancette <clalance>
Component: xenAssignee: Glauber Costa <gcosta>
Status: CLOSED CURRENTRELEASE QA Contact:
Severity: medium Docs Contact:
Priority: medium    
Version: 5.0CC: gcosta, konradr, tao, xen-maint
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: 5.0.0 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2007-01-08 14:15:18 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
xend.log-failed_migration
none
xm_dmesg-failed_migration none

Description Chris Lalancette 2006-11-13 18:25:52 UTC
Description of problem:
IBM is reporting that sometimes a migrate/live migrate fails between two
machines.  Their full bug report is below:

LTC Owner is: nivedita.com
LTC Originator is: dmsmith1.com


---Problem Description---
Migration between two identical machines fails

Contact Information = Dan Smith (danms.com)

---uname output---
Linux elm3b194.beaverton.ibm.com 2.6.18-1.2739.el5xen #1 SMP Thu Oct 26 16:30:56
EDT 2006 i686 i686 i386 GNU/Linux

Machine Type = HS20: 8843-21U

---Debugger---
A debugger is not configured

---Steps to Reproduce---
Create a domain with the xm-test ramdisk, 64MB of memory, migrate it from one
machine to the other.  If it doesn't fail, repeat until it does.  Failure is
when the domain disappears from the local machine and does not show up on the
remote.

---Xen Component Data---
Userspace tool common name: xm / xc_restore

The userspace tool has the following bit modes: 32-bit

Userspace rpm: xen

xend.log on the receiving machine:

[2006-11-08 23:25:49 xend 2477] DEBUG (XendCheckpoint:155) [xc_restore]:
/usr/l$[2006-11-08 23:25:49 xend 2477] ERROR (XendCheckpoint:236)
xc_linux_restore sta$[2006-11-08 23:25:49 xend 2477] ERROR (XendCheckpoint:236)
Increased domain res$[2006-11-08 23:25:49 xend 2477] ERROR (XendCheckpoint:236)
Reloading memory pag$[2006-11-08 23:25:50 xend 2477] ERROR (XendCheckpoint:236)
Received all pages ($[2006-11-08 23:25:50 xend 2477] ERROR (XendCheckpoint:236)
Failed to pin batch $[2006-11-08 23:25:50 xend 2477] ERROR (XendCheckpoint:236)
Restore exit with rc$[2006-11-08 23:25:50 xend.XendDomainInfo 2477] DEBUG
(XendDomainInfo:1449) Xend$[2006-11-08 23:25:50 xend.XendDomainInfo 2477] DEBUG
(XendDomainInfo:1457) Xend$[2006-11-08 23:25:50 xend.XendDomainInfo 2477] ERROR
(XendDomainInfo:1463) Xend$Traceback (most recent call last):
 File "/usr/lib/python2.4/site-packages/xen/xend/XendDomainInfo.py", line 1461$
  xc.domain_destroy(self.domid)
Error: (3, 'No such process')
[2006-11-08 23:25:50 xend 2477] ERROR (XendDomain:268) Restore failed
Traceback (most recent call last):
 File "/usr/lib/python2.4/site-packages/xen/xend/XendDomain.py", line 263, in $
  return XendCheckpoint.restore(self, fd)
 File "/usr/lib/python2.4/site-packages/xen/xend/XendCheckpoint.py", line 159,$
  forkHelper(cmd, fd, handler.handler, True)
 File "/usr/lib/python2.4/site-packages/xen/xend/XendCheckpoint.py", line 227,$
  raise XendError("%s failed" % string.join(cmd))
XendError: /usr/lib/xen/bin/xc_restore 16 1 18432 1 2 failed


xm dmesg output:

(XEN) DOM0: (file=mm.c, line=572) Bad L1 flags 80000000
(XEN) DOM0: (file=mm.c, line=850) Failure in alloc_l1_table: entry 13
(XEN) DOM0: (file=mm.c, line=1707) Error while validating mfn 1af1b (pfn ba2) f$
(XEN) DOM0: (file=mm.c, line=998) Failure in alloc_l2_table: entry 3
(XEN) DOM0: (file=mm.c, line=1707) Error while validating mfn 13762 (pfn 75b) f$
(XEN) DOM0: (file=mm.c, line=1063) Failure in alloc_l3_table: entry 3
(XEN) DOM0: (file=mm.c, line=1707) Error while validating mfn 1375e (pfn 75f) f$
(XEN) DOM0: (file=mm.c, line=1985) Error while pinning mfn 1375e


Sending machine's cpuinfo (x4):

processor       : 0
vendor_id       : GenuineIntel
cpu family      : 15
model           : 4
model name      : Intel(R) Xeon(TM) CPU 3.20GHz
stepping        : 1
cpu MHz         : 3200.226
cache size      : 1024 KB
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 5
wp              : yes
flags           : fpu tsc msr pae mce cx8 apic mtrr mca cmov pat pse36 clflush
$bogomips        : 8004.69


Receving machine's cpuinfo (x4):

processor       : 0
vendor_id       : GenuineIntel
cpu family      : 15
model           : 4
model name      : Intel(R) Xeon(TM) CPU 3.20GHz
stepping        : 1
cpu MHz         : 3200.244
cache size      : 1024 KB
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 5
wp              : yes
flags           : fpu tsc msr pae mce cx8 apic mtrr mca cmov pat pse36 clflush
$bogomips        : 8003.08


This problem only occurs on the
RHEL5 kernel at this time. The same test on xen-unstable bits gets further but
fails at a different point due to other issues.

This failure is hit on RHEL5 usually within 5 or so tries - often on first try...

Still present in 20061102...

Comment 1 Glauber Costa 2006-12-06 01:40:16 UTC
Are those messages on xm dmesg referent to a single failure, or multiple ones?
Are your logs always getting messy, with messages getting truncated this way? If
yes, maybe we should fill a bug separatedely for it. This is rendering the log
useless.

Comment 2 Glauber Costa 2006-12-06 13:59:06 UTC
From what I was able to dig up to now, the mfn we're trying to pin to an L1
table has a forbidden flag. Moreover, this flag is an unused flag. 

(XEN) DOM0: (file=mm.c, line=572) Bad L1 flags 80000000 

(Flags are currently only defined up to 4 digits). What remains to be answered,
is how did such a flag get there? 

Does the failing also happen with normal linux guests, or is it just with the
xm-test instance? 

Comment 3 Glauber Costa 2006-12-07 18:16:05 UTC
What I have so far:

alloc_l3_table calls 

get_page_and_type_from_pagenr(l3e_get_pfn(pl3e[i]),
note the ===>                 PGT_l2_page_table |
flags    ===>                 PGT_pae_xen_l2,
                              d) )

which in turn, calls get_page_type() with unmodifyied flags. This last one, has
the following piece of code:

   ASSERT(!(x & PGT_pae_xen_l2));

It's probably with ASSERTS turnt off, thus not bugging. But it gives strong
reasons to believe there may be something wrong here. This call in
alloc_l3_page() is only made in the 3rd page index. And according to your report,   

   (XEN) DOM0: (file=mm.c, line=1063) Failure in alloc_l3_table: entry 3

All of those are inside conditionals so I'm not 100 % sure. they're rather good
clues that I'll be following next days.


Comment 4 Glauber Costa 2006-12-08 14:57:03 UTC
Aren't you able to see the whole line in xm dmesg log ?

Comment 9 Archana K. Raghavan 2006-12-11 17:26:27 UTC
Created attachment 143303 [details]
xend.log-failed_migration

Comment 10 Archana K. Raghavan 2006-12-11 17:27:12 UTC
Created attachment 143304 [details]
xm_dmesg-failed_migration

Comment 12 Glauber Costa 2006-12-12 16:34:28 UTC
Please, test the rpms that can be found at http://et.redhat.com/~gcosta/ and see
if the problem still persists.

Thanks.

Comment 17 Brian Stein 2006-12-13 23:15:38 UTC
Please update the status of this issue with either the package in #12 or the
retest in #14.

Comment 19 Jay Turner 2006-12-14 02:59:05 UTC
Can't really ack until we have more details.

Comment 21 Jay Turner 2007-01-08 14:14:59 UTC
Closing out based on comment 20.