Bug 787997

Summary: Guest will hang up when restore from a saved img by using virt-manager
Product: Red Hat Enterprise Linux 5 Reporter: zhe peng <zpeng>
Component: kvmAssignee: Amit Shah <amit.shah>
Status: CLOSED WONTFIX QA Contact: Virtualization Bugs <virt-bugs>
Severity: medium Docs Contact:
Priority: medium    
Version: 5.8CC: asias, bgollahe, bsarathy, chayang, dallan, dyuan, juzhang, jwu, michen, mkenneth, mzhan, rhod, rwu, virt-maint
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2012-07-23 11:13:21 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Bug Depends On:    
Bug Blocks: 807971    
Attachments:
Description Flags
libvirt.log
none
virt-manager log
none
/var/log/message
none
guest xml file none

Description zhe peng 2012-02-07 07:58:03 UTC
Description of problem:
Guest will hang up when restore from a saved img by using virt-manager

Version-Release number of selected component (if applicable):
RHEL5.8 RC3
libvirt-0.8.2-25.el5
virt-manager-0.6.1-16.el5
python-virtinst-0.400.3-13.el5

How reproducible:
50%

Steps to Reproduce:
1. Start virt-manager.

2. install rhel5.8rc3 guest

3. after install successful,use virt-manager to open the guest,login the guest and do some operation then save the guest

4. when save finished, restore the guest from the saved file
this can not 100% reproduce this issue

another way to reproduce this:

1. Start virt-manager.
2. install rhel5.8rc3 guest
3. after install successful,run guest
4. Click the Hardware tab.
5. Click Add Hardware button
6. Select Storage in Hardware type droplist and click Forward.
7. Select managed or other existing storage and fill out the Location of  Block device(partition), eg: /dev/sda3.
8. Select vitual disk in Device type then click Forward.
9. Click Finish button.
10.save the guest
11.when save finished, restore the guest from the saved file

Actual results:
the mouse and keyboard not worked, guest hang up.


Expected results:
user can do operation in guest.

Additional info:
if make guest running and reboot host,when host started,the guest also hang up sometimes
if manual force off the guest and start the guest again, all worked well.

Comment 1 zhe peng 2012-02-07 08:00:36 UTC
Created attachment 559862 [details]
libvirt.log

Comment 2 zhe peng 2012-02-07 08:01:23 UTC
Created attachment 559863 [details]
virt-manager log

Comment 3 Cole Robinson 2012-02-07 14:19:09 UTC
I'm assuming this is a libvirt issue, since virt-manager isn't doing much except calling libvirt APIs. Reassigning

Please verify that you can reproduce using 'virsh save' and 'virsh restore' (or managed-save and start, if that's what virt-manager supports. if virt-manager is asking you for a path to save the file, you need to use 'virsh save')

Comment 4 zhe peng 2012-02-08 05:18:43 UTC
I can reproduce this issue using 'virsh save' and 'virsh restore'
step:
 1:virsh start rhel5.8rc3
 2:login guest,run command 'modprobe acpiphp'
 3:in host run 
#virsh attach-disk rhel5.8rc3 /dev/sda2 --target vdb --driver qemu
Disk attached successfully
login guest ,the vdb hotplug successful
 4:#virsh save rhel5.8rc3 /tmp/rhel5.save
Domain rhel5.8rc3 saved to /tmp/rhel5.save
 5:#virsh restore /tmp/rhel5.save
Domain restored form /tmp/rhel5.save
 6:#virt-viewer rhel5.8rc3
in guest ,the mouse and keyboard not worked.

Comment 5 zhe peng 2012-02-08 05:21:52 UTC
Created attachment 560132 [details]
/var/log/message

Comment 6 zhe peng 2012-02-08 05:23:25 UTC
Created attachment 560133 [details]
guest xml file

Comment 7 Dave Allan 2012-02-08 16:39:20 UTC
Is the guest actually responsive, or is the guest OS hung?

Comment 8 zhe peng 2012-02-09 02:47:48 UTC
I thought the guest OS is hung, i can't ssh to guest after restored,before save,ssh can worked well.

Comment 9 RHEL Program Management 2012-03-30 14:17:18 UTC
This request was evaluated by Red Hat Product Management for inclusion
in a Red Hat Enterprise Linux release.  Product Management has
requested further review of this request by Red Hat Engineering, for
potential inclusion in a Red Hat Enterprise Linux release for currently
deployed products.  This request is not yet committed for inclusion in
a release.

Comment 11 Gunannan Ren 2012-06-27 12:56:32 UTC
I couldn't reproduce the problem on my test machine. I will try to reproduce it on reporter's machine.

Comment 12 Gunannan Ren 2012-06-30 13:06:15 UTC
The problem could be reproduced 100 percent on this situation:
1, guest is RHEL5u8 release
2, host is RHEL5u8 release
3, Before runing "virsh save", we attach a hotpluggable virtio disk into the guest. (For rhel5 guest ,it is necessary to first insert acpiphp kernel module by using modprobe acpiphp in the guest for successful attaching)
4, save guest into a file.
5, run virsh restore <file.save> to restore this guest.
After the above 5 steps, the guest is hung up forever.

The problem couldn't be encountered on the any of following situations:
The guest is RHEL6 or above, 
The guest is RHEL5u8 without attached disk before saving.
The host is RHEL6 or above rather than RHEL5u8

During the process of vm restore, libvirt first forks a subprocess to run qemu-kvm command lines with the stdin set to the file descriptor of that opened file.save file. Then, the parent process are going to do other stuff, one of these things the parent process did is to set balloon memory via qemu monitor. In the case of bug, after libvirt send 'balloon 1024' to qemu monitor, the guest will forever hang up, even though libvirt sends 'cont' to start the qemu process later, the vm just hangs up.

If we add usleep() to sleep the parent process for a while just before setting the balloon memory, the restore will success. So I think this is a bug of qemu-kvm on RHEL5u8. Probably a race happened.

I used the newest kvm RHEL5 version: kvm-83-254.el5 on brew, the issue still exists.


How to reproduce by hand using qemu command line.

1. start a guest by using the following command line.

/usr/libexec/qemu-kvm -S -M rhel5.4.0 -m 1024 -smp 1,sockets=1,cores=1,threads=1 -name rhel5u8 -monitor unix:/var/lib/libvirt/qemu/rhel5u8.monitor,server,nowait -no-kvm-pit-reinjection -boot c -drive file=/var/lib/libvirt/images/rhel5u8.img,if=ide,bus=0,unit=0,boot=on,format=raw,cache=none -serial pty -parallel none -usb -vnc 127.0.0.1:0 -k en-us -vga cirrus -balloon virtio

2, run 'modprobe acpiphp' in the rhel5u8 guest. Then, hotplug a virtio disk via qemu monitor console

pci_add pci_addr=auto storage file=/var/lib/libvirt/images/attachdisk,if=virtio

3, save the guest into a file via qemu monitor console

migrate exec:cat>/tmp/rhel5u8.save.hand

4, kill the previous qemu process, start a new qemu process with the following command line.

 /usr/libexec/qemu-kvm -S -M rhel5.4.0 -m 1024 -smp 1,sockets=1,cores=1,threads=1 -name rhel5u8 -monitor unix:/var/lib/libvirt/qemu/rhel5u8.monitor,server,nowait -no-kvm-pit-reinjection -boot c -drive file=/var/lib/libvirt/images/rhel5u8.img,if=ide,bus=0,unit=0,boot=on,format=raw,cache=none -drive file=/var/lib/libvirt/images/attachdisk,if=virtio,format=raw -serial pty -parallel none -usb -vnc 127.0.0.1:0 -k en-us -vga cirrus -incoming "exec:cat /tmp/rhel5u8.save.hand" -balloon virtio

5, Quickly connect the qemu mointor console, and run "balloon 1024"
nc -U /var/lib/libvirt/qemu/rhel5u8.monitor
(qemu) balloon 1024

Then the guest will hang up foreverl, even though we send 'cont' to qemu later it doesn't work anymore.

Comment 13 Gunannan Ren 2012-07-02 14:37:28 UTC
This problems doesn't exist for RHEL6 kvm version. 
According to the reproducing procedure described via qemu-kvm command lineon comments 12, It seems like a bug on qemu-kvm that maybe fixed on RHEL6.
So I change component to qemu-kvm for help.

Comment 14 Chao Yang 2012-07-13 11:27:58 UTC
Tested more on kvm-83-256.el5, 2.6.18-322.el5. Here is what I get:
1. with '-S', '-balloon virtio' in cli, can reproduce this issue with the same steps in #12. If don't send balloon but cont to monitor, not reproducible.
2. with '-S', '-balloon none' in cli, reproducible with the same steps in #12

3. without '-S', with '-balloon virtio' in cli, only get some call trace in guest while issuing balloon to monitor
4. without '-S', with '-balloon virtio' in cli, guest works fine if don't issue balloon to monitor
5. without '-S', with '-balloon none' in cli, guest restores fine with/without issuing balloon

Anyway, after doing hot plug/unplug, then migrate guest, (monitor)info pci in src and dst tells difference. Also from bz652146#c4, "In RHEL5.x you can't do hot-plug/unplug and then migrate. It is known not to work."

Comment 15 Chao Yang 2012-07-13 11:31:32 UTC
SRC host:
--------
(qemu) pci_add pci_addr=auto storage file=/root/test.img,if=virtio
pci_add pci_addr=auto storage file=/root/test.img,if=virtio
OK domain 0, bus 0, slot 5, function 0
(qemu) info pci
info pci
  Bus  0, device   0, function 0:
    Host bridge: PCI device 8086:1237
  Bus  0, device   1, function 0:
    ISA bridge: PCI device 8086:7000
  Bus  0, device   1, function 1:
    IDE controller: PCI device 8086:7010
      BAR4: I/O at 0xc000 [0xc00f].
  Bus  0, device   1, function 2:
    USB controller: PCI device 8086:7020
      IRQ 11.
      BAR4: I/O at 0xc020 [0xc03f].
  Bus  0, device   1, function 3:
    Bridge: PCI device 8086:7113
      IRQ 9.
  Bus  0, device   2, function 0:
    VGA controller: PCI device 1013:00b8
      BAR0: 32 bit memory at 0xc2000000 [0xc3ffffff].
      BAR1: 32 bit memory at 0xc4000000 [0xc4000fff].
  Bus  0, device   3, function 0:
    Ethernet controller: PCI device 10ec:8139
      IRQ 11.
      BAR0: I/O at 0xc100 [0xc1ff].
      BAR1: 32 bit memory at 0xc4001000 [0xc40010ff].
  Bus  0, device   4, function 0:
    RAM controller: PCI device 1af4:1002
      IRQ 11.
      BAR0: I/O at 0xc200 [0xc21f].
  Bus  0, device   5, function 0:
    SCSI controller: PCI device 1af4:1001
      IRQ 0.
      BAR0: I/O at 0x1000 [0x103f].

DST host:
--------
(qemu) info pci
info pci
  Bus  0, device   0, function 0:
    Host bridge: PCI device 8086:1237
  Bus  0, device   1, function 0:
    ISA bridge: PCI device 8086:7000
  Bus  0, device   1, function 1:
    IDE controller: PCI device 8086:7010
      BAR4: I/O at 0xc000 [0xc00f].
  Bus  0, device   1, function 2:
    USB controller: PCI device 8086:7020
      IRQ 11.
      BAR4: I/O at 0xc020 [0xc03f].
  Bus  0, device   1, function 3:
    Bridge: PCI device 8086:7113
      IRQ 9.
  Bus  0, device   2, function 0:
    VGA controller: PCI device 1013:00b8
      BAR0: 32 bit memory at 0xc2000000 [0xc3ffffff].
      BAR1: 32 bit memory at 0xc4000000 [0xc4000fff].
  Bus  0, device   3, function 0:
    Ethernet controller: PCI device 10ec:8139
      IRQ 11.
      BAR0: I/O at 0xc100 [0xc1ff].
      BAR1: 32 bit memory at 0xc4001000 [0xc40010ff].
  Bus  0, device   4, function 0:
    SCSI controller: PCI device 1af4:1001
      IRQ 0.
      BAR0: I/O at 0x1000 [0x103f].
  Bus  0, device   5, function 0:
    RAM controller: PCI device 1af4:1002
      IRQ 11.
      BAR0: I/O at 0xc200 [0xc21f].

Comment 16 Amit Shah 2012-07-13 12:08:16 UTC
Please paste guest kernel output from serial console when the guest hangs up.

Are there any panic/oops messages?

From comment 14: what is the call trace you get in the guest?

Comment 17 Chao Yang 2012-07-13 13:46:31 UTC
(In reply to comment #16)
> Please paste guest kernel output from serial console when the guest hangs up.
> 
I don't see any output of serial port when guest hangs up.
> Are there any panic/oops messages?
> 
> From comment 14: what is the call trace you get in the guest?

irq 10: nobody cared (try booting with the "irqpoll" option)

Call Trace:
 <IRQ>  [<ffffffff800be9d2>] __report_bad_irq+0x30/0x7d
 [<ffffffff800bec10>] note_interrupt+0x1f1/0x232
 [<ffffffff800be110>] __do_IRQ+0x114/0x15b
 [<ffffffff8006d4d1>] do_IRQ+0xe9/0xf7
 [<ffffffff8005d615>] ret_from_intr+0x0/0xa
 [<ffffffff80012519>] __do_softirq+0x51/0x133
 [<ffffffff8005e2fc>] call_softirq+0x1c/0x28
 [<ffffffff8006d646>] do_softirq+0x2c/0x7d
 [<ffffffff8006d4d6>] do_IRQ+0xee/0xf7
 [<ffffffff8006be03>] default_idle+0x0/0x50
 [<ffffffff8005d615>] ret_from_intr+0x0/0xa
 <EOI>  [<ffffffff8006be2c>] default_idle+0x29/0x50
 [<ffffffff80048f92>] cpu_idle+0x95/0xb8
 [<ffffffff8046d809>] start_kernel+0x220/0x225
 [<ffffffff8046d22f>] _sinittext+0x22f/0x236

handlers:
[<ffffffff8819b5a0>] (cp_interrupt+0x0/0x360 [8139cp])
[<ffffffff8824515f>] (vp_interrupt+0x0/0xc1 [virtio_pci])
Disabling IRQ #10

Comment 18 Amit Shah 2012-07-19 07:14:36 UTC
OK - so this happens only if a disk is hotplugged before saving and restoring the guest.

Saving and restoring the guest using migrate-to-file is the same as migrating the guest to a different host.

This scenario isn't supposed to work, as also mentioned in 652146.

The only suspicious thing coule be that RHEL6 guest works fine on RHEL5 host with disk hot-plug/unplug.  Can you confirm this works fine (or not)?  If yes, we may have to look at the problem differently.  Please try a few times, since if it indeed is a race, it may not trigger immediately.

Comment 19 Amit Shah 2012-07-19 07:18:16 UTC
(In reply to comment #17)
> > From comment 14: what is the call trace you get in the guest?
> 
> irq 10: nobody cared (try booting with the "irqpoll" option)
> 
> Call Trace:
>  <IRQ>  [<ffffffff800be9d2>] __report_bad_irq+0x30/0x7d

I think what's happening is that the devices may not be on the exact same pci address before and after save/resume.  This obviously will confuse the guest.  Since with RHEL5 we don't have a mechanism to place devices on specific addresses, this may very well end up not being a supported workflow at all.

Comment 20 Ronen Hod 2012-07-23 11:13:21 UTC
Closing,
The infrastructure change is too big for RHEL5.