Bug 253583

Summary: Live migration of HVM/Fully-Virt guests crashes target host/dom0
Product: Red Hat Enterprise Linux 5 Reporter: Jan Mark Holzer <jmh>
Component: xenAssignee: Chris Lalancette <clalance>
Status: CLOSED ERRATA QA Contact: Virtualization Bugs <virt-bugs>
Severity: high Docs Contact:
Priority: high    
Version: 5.1CC: clalance, mjenner, xen-maint
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: RHEA-2007-0635 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2007-11-07 17:11:31 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Simple patch to fix the crash
none
A different patch to the privcmd stuff none

Description Jan Mark Holzer 2007-08-20 17:48:00 UTC
Description of problem:

Initiating a live migration (or offline migration) on RHEL5.1 (-40 kernel) 
with a HVM/Fully-virt RHEL and/or Windows guest will cause the target 
host/dom0 to panic without any dump or console message


Version-Release number of selected component (if applicable):
Linux woodie.lab.boston.redhat.com 2.6.18-40.el5xen #1 SMP Tue Aug 14 18:12:49
EDT 2007 x86_64 x86_64 x86_64 GNU/Linux

on both hosts (woodie/buzz)


How reproducible:

Select a HVM guest for migration (live or offline) and watch the target host panic

Steps to Reproduce:
1. Start a HVM guest on your source dom0
2. Initiate a live/offline migration similar to the following
   command 
   # xm migrate --live Windows2003 buzz
   or for offline
   # xm migrate rhel3u9fvi386 buzz

3. The migration command will pretend it completed and not give any indication
    the target host panic'd
4. ping target host , no response

Migrating a para-virt guests works as expected
  
Actual results:

New target host panic

Expected results:

Guest migrates successful to new target host

Additional info:

Last messages from target host just before it panics.
The guest has been started and is reloading its memory pages and then
 "boom"

[2007-08-20 13:40:52 xend.XendDomainInfo 3893] DEBUG (XendDomainInfo:791)
Storing domain details: {'console/port': '3', 'name': 'rhel3u9fvi386',
'console/limit': '1048576', 'vm': '/vm/c2921aca-4c0d-b169-35a4-7e0cf688eb17',
'domid': '1', 'cpu/0/availability': 'online', 'memory/target': '1048576',
'store/port': '2'}
[2007-08-20 13:40:52 xend 3893] INFO (XendCheckpoint:180) restore hvm domain 1,
apic=0, pae=0
[2007-08-20 13:40:52 xend 3893] DEBUG (XendCheckpoint:190) restore:shadow=0x9,
_static_max=0x400, _static_min=0x400, 
[2007-08-20 13:40:52 xend 3893] DEBUG (balloon:133) Balloon: 129820 KiB free; 0
to scrub; need 1057792; retries: 20.
[2007-08-20 13:40:52 xend 3893] DEBUG (balloon:148) Balloon: setting dom0 target
to 7025 MiB.
[2007-08-20 13:40:52 xend.XendDomainInfo 3893] DEBUG (XendDomainInfo:1078)
Setting memory target of domain Domain-0 (0) to 7025 MiB.
[2007-08-20 13:40:54 xend 3893] DEBUG (balloon:127) Balloon: 1058588 KiB free;
need 1057792; done.
[2007-08-20 13:40:54 xend 3893] DEBUG (XendCheckpoint:202) [xc_restore]:
/usr/lib64/xen/bin/xc_restore 15 1 2 3 1 0 0
[2007-08-20 13:40:54 xend 3893] INFO (XendCheckpoint:338) xc_domain_restore
start: p2m_size = 100000
[2007-08-20 13:40:54 xend 3893] INFO (XendCheckpoint:338) Reloading memory
pages:   0%
[2007-08-20 13:41:09 xend 3893] INFO (XendCheckpoint:338) Received all pages (0
races)
[2007-08-20 13:41:09 xend 3893] DEBUG (XendCheckpoint:309) store-mfn 262142
[2007-08-20 13:41:09 xend 3893] INFO (XendCheckpoint:338) Restore exit with rc=0

All logfiles are available on request from woodie/buzz
Problem is reliable reproducible locally

Comment 1 Chris Lalancette 2007-08-20 19:12:12 UTC
Ah, OK.  Here we go.  I hooked up a serial console and did:

# xm migrate rhel4u3fv <remote>

and got the following stack trace out of the remote side:

(XEN) save.c:170:d0 HVM restore: saved CPUID (0x100f20) does not match host
(0x40f12).
(XEN) save.c:176:d0 HVM restore: Xen changeset was not saved.
(XEN) lapic_load to rearm the actimer:bus cycle is 10ns, saved tmict count 6250,
period 1000000ns, irq=239
----------- [cut here ] --------- [please bite here ] ---------
Kernel BUG at mm/memory.c:2290
invalid opcode: 0000 [1] SMP 
last sysfs file: /class/misc/evtchn/dev
CPU 0 
Modules linked in: tun xt_physdev netloop netbk blktap blkbk ipt_MASQUERADE
iptable_nat ip_nat xt_state ip_conntrack nfnetlink ipt_REJECT xt_tcpudp
iptable_filter ip_tables x_tables bridge autofs4 hidp nfs lockd fscache nfs_acl
rfcomm l2cap bluetooth sunrpc ipv6 dm_mirror dm_multipath dm_mod video sbs
backlight i2c_ec button battery asus_acpi ac parport_pc lp parport snd_hda_intel
snd_hda_codec snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq
snd_seq_device snd_pcm_oss snd_mixer_oss snd_pcm floppy snd_timer snd sg
soundcore snd_page_alloc pcspkr forcedeth shpchp serial_core i2c_nforce2
i2c_core serio_raw ide_cd cdrom k8_edac edac_mc k8temp hwmon sata_nv libata
mptsas mptscsih mptbase scsi_transport_sas sd_mod scsi_mod ext3 jbd ehci_hcd
ohci_hcd uhci_hcd
Pid: 4214, comm: qemu-dm Not tainted 2.6.18-41.el5xen #1
RIP: e030:[<ffffffff80208b30>]  [<ffffffff80208b30>] __handle_mm_fault+0x379/0xf46
RSP: e02b:ffff88004a73dde8  EFLAGS: 00010202
RAX: ffffffff80514840 RBX: 0000000000000000 RCX: 00003ffffffff000
RDX: 00000000496d4000 RSI: 0000000000000067 RDI: ffff880074412040
RBP: ffff880074412040 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000001 R11: 0000000000000246 R12: 0000000000000000
R13: ffff8800496d4000 R14: 00002aaaac800000 R15: ffff880072bdf870
FS:  00002aaaab11e900(0000) GS:ffffffff80599000(0000) knlGS:0000000000000000
CS:  e033 DS: 0000 ES: 0000
Process qemu-dm (pid: 4214, threadinfo ffff88004a73c000, task ffff8800720b10c0)
Stack:  00000000fffffff2  0000000180275e4c  ffff880074412040  ffff88004a952b20 
 0000000000001000  00002aaaace1b000  ffff880074412040  ffffffff80261889 
 ffff8800744120a8  ffffffff80221f0c 
Call Trace:
 [<ffffffff80261889>] _spin_lock_irqsave+0x9/0x14
 [<ffffffff80221f0c>] __up_read+0x19/0x7f
 [<ffffffff802641db>] do_page_fault+0xe48/0x11dc
 [<ffffffff8030b0dd>] file_has_perm+0x94/0xa3
 [<ffffffff8025d823>] error_exit+0x0/0x6e


Code: 0f 0b 68 ce 50 47 80 c2 f2 08 49 8b 87 90 00 00 00 48 c7 44 
RIP  [<ffffffff80208b30>] __handle_mm_fault+0x379/0xf46
 RSP <ffff88004a73dde8>
 <0>Kernel panic - not syncing: Fatal exception
 (XEN) Domain 0 crashed: rebooting machine in 5 seconds.

This is the same bug we had in an earlier BZ, namely crashing because we reached
do_no_page with VM_PFNMAP.  I thought it had been fixed by the 5.1 stuff, but
apparently not.  I'm going to try to fix this now.

Chris Lalancette

Comment 2 Chris Lalancette 2007-08-21 13:31:53 UTC
Created attachment 161969 [details]
Simple patch to fix the crash

This is a simple patch to remove VM_PFNMAP flag from the privcmd mmap pages;
this matches upstream Xen, and also seems to correct the HVM live migrate
crash.

Chris Lalancette

Comment 3 Chris Lalancette 2007-08-21 21:35:18 UTC
Today's update:

There are two bugs that have similar signatures:

https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=253479
https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=249409

So they might also be fixed by fixing this bug.

After some additional testing, I found that this same bug can happen when live
migrating HVM domains from i386->i386.  I've also found that although the above
patch prevents the dom0 crash on x86_64, it doesn't actually allow guests to
work.  Additionally, the above patch makes it so that *no* guests will even boot
on i686, so something else is going on here.

Talking to Rik, it seems that VM_PFNMAP is used for vma's that don't have struct
page associated with them.  However, VMA's that use VM_PFNMAP should *not* be
specifying a nopage handler, since that can cause the crash (handle_pte_fault
calls do_no_page iff the vma has a nopage handler that != NULL).

Rik thinks that the entire reason we are hitting the nopage stuff, though, is
that the tools have some sort of off-by-one error that is causing the page
fault.  The questions to answer are:

1)  What is causing that additional page fault?
2)  Why does upstream Xen not have VM_PFNMAP, and how do they get away with not
needing it?  Does it have the same problem, and we are just missing something,
or is it broken as well?

Chris Lalancette

Comment 4 Chris Lalancette 2007-08-22 17:45:12 UTC
Created attachment 162501 [details]
A different patch to the privcmd stuff

This is another patch to attempt to solve this issue.  As pointed out by Rik,
when we use VM_PFNMAP, we should not have a nopage handler, otherwise
handle_pte_fault() causes us to die.  This patch totally removes the nopage
handler from privcmd.  This has currently been tested to make x86_64 off-line
and live migrations work successfully.	I was also able to start i686 PV and
HVM guests with this patch applied; I still need to test that migrate works on
that arch.  So, to confirm that this patch fixes things, I need to:

1)  Make sure that I can start, save, restore, off-line migrate, live migrate
PV and HVM guests on x86_64
2)  Make sure that I can start, save, restore, off-line migrate, live migrate
PV and HVM guests on i686
3)  Make sure that I can start, save, restore, off-line migrate, live migrate
PV and HVM guests on ia64
4)  Make sure that this patch fixes the other, related BZ's

Chris Lalancette

Comment 5 Chris Lalancette 2007-08-23 22:40:06 UTC
Testing results:

1)  x86_64: Successfully started PV and HVM guests.  Successfully save/restore
PV and HVM guests.  Successfully off-line migrate PV and HVM guests. 
Successfully live migrate PV and HVM guests.

2)  i686: Successfully started PV and HVM guests.  Successfully save/restore PV
and HVM guests.  Successfully off-line migrate PV and HVM guests.  Successfully
live migrate PV and HVM guests.

3)  ??  I don't have the hardware to test.

4)  Succeeds.

However, despite this testing, there is still a problem; that will be enumerated
in a second post.

Chris Lalancette

Comment 6 Chris Lalancette 2007-08-23 22:50:06 UTC
So, despite the successful testing with the above patch, there is still a
problem.  I'll try to explain what the problem is, and why the above patch is
needed as *part* of the solution.

Note that the term "local" will mean the machine the live migrate is initiated
from, and "remote" will mean the machine the live migrate will end up on.

1)  When an HVM live migrate is started, the local machine sends over
information about how much memory the new domain will need on the remote
machine.  The python tools on the remote machine dutifully take that value and
balloon down dom0, freeing up memory for the domain.

2)  Next, the tools on the remote machine allocate the memory for that domain. 
On a machine where dom0 was using all of available memory before the migrate
started, this means that after this allocation, there will be *precisely* 0
additional pages for the hypervisor to hand out to domains (note that it keeps
memory around for itself, but that is not relevant here).

3)  However, when qemu starts up on the remote side, it needs a few additional
pages for the device emulation.  In particular, the Cirrus VGA does a
populate_physmap followed by an xc_map_foreign_pages.

It is this 3rd step that causes all of the issues.  Since the HV has no more
memory to hand out to domains, it actually fails the populate_physmap.  However,
QEMU does *not* check the return code after the populate physmap, so it blindly
goes ahead and does an "xc_map_foreign_pages", on pages that have not been
successfully mapped.  This causes the page_fault to happen in the dom0, and
causes the do_no_page() call and the BUG_ON().

So the fix here is in multiple parts.
1)  Since userland should not be crashing dom0, regardless if it is doing the
wrong thing (as stated in BZ 249409), I believe we need the patch that is
already attached to this BZ.
2)  QEMU *should* check to see if populate_physmap failed, and take appropriate
action.  In this case, if it fails because it is out of memory, the migration
will actually succeed; it just won't have emulated video available.
3)  QEMU should attempt to balloon the dom0 down the number of additional pages
it needs to succeed.

With the above 3 fixes in place, I am able to not only prevent the crash, but
also to succeed in the migration.  Note that I am still having problems with
video, but I believe that is a secondary bug.

Tomorrow I will work on making better versions of items 2) and 3), and pushing
those upstream and internally.

Chris Lalancette

Comment 7 Chris Lalancette 2007-08-24 15:06:25 UTC
Comment on attachment 162501 [details]
A different patch to the privcmd stuff

I'm tracking the kernel side of this problem in
https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=249409
so this patch is stale now.

Chris Lalancette

Comment 10 Daniel Berrangé 2007-09-04 18:25:18 UTC
Fix built into:

* Fri Aug 31 2007 Daniel P. Berrange <berrange> - 3.0.3-38.el5
- Fixed memory ballooning for HVM restore (rhbz #253583)


Comment 13 errata-xmlrpc 2007-11-07 17:11:31 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHEA-2007-0635.html