Bug 604087
Summary: | QEMU cannot save/restore successfully after device hotplug | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 5 | Reporter: | Dmitry Konishchev <konishchev> |
Component: | libvirt | Assignee: | Daniel Veillard <veillard> |
Status: | CLOSED CANTFIX | QA Contact: | Virtualization Bugs <virt-bugs> |
Severity: | medium | Docs Contact: | |
Priority: | low | ||
Version: | 5.5 | CC: | berrange, dallan, dyuan, jdenemar, jialiu, llim, virt-maint, xen-maint |
Target Milestone: | rc | ||
Target Release: | --- | ||
Hardware: | x86_64 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | libvirt-0.8.2-1.el5 | Doc Type: | Bug Fix |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2010-11-22 12:54:55 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Dmitry Konishchev
2010-06-15 11:57:16 UTC
Fixed in libvirt-0.8.2-1.el5 Re-test this bug with libvirt-0.8.2-7.el5 on RHEL5.6-Server-20100930.0 x86_64 KVM host, still failed. libvirt-0.8.2-7.el5 kvm-83-200.el5 kmod-kvm-83-200.el5 attach-disk bug: 1. Run an KVM domain. 2. Save and restore the domain. # virsh save rhel5u5 /tmp/tt.save Domain rhel5u5 saved to /tmp/tt.save # virsh restore /tmp/tt.save Domain restored from /tmp/tt.save 3. In guest, load apciphp module, and try to attach a disk via virsh command # virsh attach-disk rhel5u5 /var/lib/libvirt/images/newdisk.img vda --sourcetype=file --driver=qemu --subdriver=qcow2 Disk attached successfully # virsh dumpxml rhel5u5 ... <disk type='file' device='disk'> <driver name='qemu' type='qcow2'/> <source file='/var/lib/libvirt/images/newdisk.img'/> <target dev='vda' bus='virtio'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x05' function='0x2aaa'/> </disk> ... libvirt report no errors but the device will not appear in the guest OS. Restoring bug: 1. Run an KVM domain. In guest, load acpiphp module # modprobe acpiphp 2. Attach a disk to the domain. # virsh attach-disk rhel5u5 /var/lib/libvirt/images/newdisk.img vda --sourcetype=file --driver=qemu --subdriver=qcow2 Disk attached successfully In geust, the new disk is seen. # fdisk -l 3. Save and restore the domain. # virsh save rhel5u5 /tmp/tt.save Domain rhel5u5 saved to /tmp/tt.save # virsh restore /tmp/tt.save Domain restored from /tmp/tt.save # virsh list --all Id Name State ---------------------------------- 16 rhel5u5 running The virsh command report no any error. But the domain looks like dead, because I can not input anything in guest via virt-viewer. Log in /var/log/libvirt/qemu/rhel5u5.log: LC_ALL=C PATH=/sbin:/usr/sbin:/bin:/usr/bin HOME=/ QEMU_AUDIO_DRV=none /usr/libexec/qemu-kvm -S -M rhel5.4.0 -m 1024 -smp 1 -name rhel5u5 -uuid 6c143a7c-150f-dc29-224c-940672cbea1d -monitor unix:/var/lib/libvirt/qemu/rhel5u5.monitor,server,nowait -boot c -drive file=/var/lib/libvirt/images/rhel5u5.img,if=ide,bus=0,unit=0,boot=on,format=raw,cache=none -net nic,macaddr=54:52:00:3d:53:ca,vlan=0 -net tap,fd=16,vlan=0 -serial pty -parallel none -usb -vnc 127.0.0.1:0 -k en-us -vga cirrus -balloon virtio char device redirected to /dev/pts/2 0+0 records in 0+0 records out 0 bytes (0 B) copied, 1.1e-05 seconds, 0.0 kB/s 9+35704 records in 9+35704 records out 311379825 bytes (311 MB) copied, 8.99713 seconds, 34.6 MB/s LC_ALL=C PATH=/sbin:/usr/sbin:/bin:/usr/bin HOME=/ QEMU_AUDIO_DRV=none /usr/libexec/qemu-kvm -S -M rhel5.4.0 -m 1024 -smp 1 -name rhel5u5 -uuid 6c143a7c-150f-dc29-224c-940672cbea1d -monitor unix:/var/lib/libvirt/qemu/rhel5u5.monitor,server,nowait -boot c -drive file=/var/lib/libvirt/images/rhel5u5.img,if=ide,bus=0,unit=0,boot=on,format=raw,cache=none -drive file=/var/lib/libvirt/images/newdisk.img,if=virtio,format=qcow2 -net nic,macaddr=54:52:00:3d:53:ca,vlan=0 -net tap,fd=18,vlan=0 -serial pty -parallel none -usb -vnc 127.0.0.1:0 -k en-us -vga cirrus -incoming exec:cat -balloon virtio char device redirected to /dev/pts/2 According to comments 3, change this bug's status to "ASSIGNED" from "ON_QA" Hmm, it works just fine for me. Also as qemu log suggests, qemu was started correctly with newdisk.img attached. I'll talk to you online tomorrow afternoon. In the meantime, could you try upgrading kvm to the latest version from nightly and check if that makes any difference? save/restore after doing hotplug operations is possible, but incredibly fragile in a RHEL-5 era QEMU. The problem is that there is no support for explicit PCI device slot numbering on the command line. So when libvirt does the restore step, there is a strong chance that the QEMU process will give some of the device different PCI slots than they had before saving. Needless to say, the guest OS will get very unhappy if PCI slots change while it is running. Ah, good point. Did you only attached that one disk and saved/resumed or did you do several attach-disk/detach-disk commands before saving/restoring the guest? The letter case is very likely to hit the issue described by Daniel. It is worse than just how many you attach/detach. It varies according to the type of device being plugged. The order in which QEMU assigns addresses for devices on the command line is: * - Host bridge (slot 0) * - PIIX3 ISA bridge, IDE controller, something else unknown, USB controller (slot 1) * - Video (slot 2) * - Net * - Sound * - SCSI controllers * - VirtIO block * - VirtIO balloon * - Host device passthrough * - Watchdog So if you hotplug a NIC, then all SCSI controllers and VirtIO block devices get their PCI slow shifted down one number. And hotplugging anything shifts the virtio balloon. Hotplugging SCSI disks is fairly safe, if at least 1 SCSI disk already exists, because that is plugging disks into an existing SCSI controller thus not allocating any new PCI addrs. Pls ignore the comment 9. # uname -a Linux dhcp-66-92-15.nay.redhat.com 2.6.18-228.el5 #1 SMP Mon Oct 18 12:22:14 EDT 2010 x86_64 x86_64 x86_64 GNU/Linux # rpm -qa|grep libvirt libvirt-0.8.2-8.el5 libvirt-python-0.8.2-8.el5 # rpm -qa|grep kvm etherboot-zroms-kvm-5.4.4-13.el5 kmod-kvm-83-205.el5 kvm-qemu-img-83-205.el5 kvm-83-205.el5 # getenforce Enforcing According to the original description, there are 2 problems in 2 test scenario: 1. attach-disk bug: # virsh save rhel5 /tmp/rhel5.save Domain rhel5 saved to /tmp/rhel5.save # virsh restore /tmp/rhel5.save Domain restored from /tmp/rhel5.save # virsh attach-disk rhel5 /var/lib/libvirt/images/foo.img vda --sourcetype=file --driver=qemu Disk attached successfully attach-disk bug: ##################### Attach disk after save/restore guest, libvirt report no errors but the device will not appear in the guest OS. no error in /var/log/messages and /var/log/libvirt/qemu/rhel5.log ##################### ---- Shutdown the guest manually, then start guest for the 2nd test scenario. 2. Restoring bug: issue # modprobe acpiphp in guest. # virsh attach-disk rhel5 /var/lib/libvirt/images/foo.img vda --sourcetype=file --driver=qemu Disk attached successfully issue # fdisk -l in guest, the new disk is seen. # virsh save rhel5 /tmp/rhel5-1.save Domain rhel5 saved to /tmp/rhel5-1.save # virsh restore /tmp/rhel5-1.save Domain restored from /tmp/rhel5-1.save Restoring bug: ##################### # fdisk will hang, wait a few mins, the guest will shutdown automatic. before it shutdown, I see 'Call Trace' and 'init 0' in terminal but I didn't issue this command. ##################### I start the guest manually, then get the /var/log/messages : Oct 23 00:25:00 localhost kernel: acpiphp: Slot [31] registered Oct 23 00:25:09 localhost kernel: ACPI Exception (pci_bind-0302): AE_NOT_FOUND, Unable to get data from device S5 [20060707] Oct 23 00:25:09 localhost kernel: PCI: Ignore bogus resource 6 [0:0] of 0000:00:02.0 Oct 23 00:25:09 localhost kernel: decode_hpp: Could not get hotplug parameters. Use defaults Oct 23 00:25:09 localhost kernel: PCI: Enabling device 0000:00:05.0 (0000 -> 0001) Oct 23 00:25:09 localhost kernel: ACPI: PCI Interrupt Link [LNKA] enabled at IRQ 10 Oct 23 00:25:09 localhost kernel: ACPI: PCI Interrupt 0000:00:05.0[A] -> Link [LNKA] -> GSI 10 (level, high) -> IRQ 10 Oct 23 00:25:10 localhost kernel: vda: unknown partition table Oct 23 00:26:01 localhost kernel: atkbd.c: Unknown key released (translated set 2, code 0x0 on isa0060/serio0). Oct 23 00:26:01 localhost kernel: atkbd.c: Use 'setkeycodes 00 <keycode>' to make it known. Oct 23 00:27:52 localhost kernel: atkbd.c: Unknown key released (translated set 2, code 0x0 on isa0060/serio0). Oct 23 00:27:52 localhost kernel: atkbd.c: Use 'setkeycodes 00 <keycode>' to make it known. Oct 23 00:28:24 localhost kernel: irq 11: nobody cared (try booting with the "irqpoll" option) Oct 23 00:28:24 localhost kernel: Oct 23 00:28:24 localhost kernel: Call Trace: Oct 23 00:28:24 localhost kernel: <IRQ> [<ffffffff800bc899>] __report_bad_irq+0x30/0x7d Oct 23 00:28:24 localhost kernel: [<ffffffff800bcacc>] note_interrupt+0x1e6/0x227 Oct 23 00:28:24 localhost kernel: [<ffffffff800bbfc8>] __do_IRQ+0xbd/0x103 Oct 23 00:28:24 localhost kernel: [<ffffffff8006da2b>] do_IRQ+0xe7/0xf5 Oct 23 00:28:24 localhost kernel: [<ffffffff8005e615>] ret_from_intr+0x0/0xa Oct 23 00:28:24 localhost kernel: [<ffffffff800123d1>] __do_softirq+0x51/0x133 Oct 23 00:28:24 localhost kernel: [<ffffffff8005f2fc>] call_softirq+0x1c/0x28 Oct 23 00:28:24 localhost kernel: [<ffffffff8006dba8>] do_softirq+0x2c/0x85 Oct 23 00:28:24 localhost kernel: [<ffffffff8006c360>] default_idle+0x0/0x50 Oct 23 00:28:24 localhost kernel: [<ffffffff8005ec8e>] apic_timer_interrupt+0x66/0x6c Oct 23 00:28:24 localhost kernel: <EOI> [<ffffffff8006c389>] default_idle+0x29/0x50 Oct 23 00:28:24 localhost kernel: [<ffffffff800497be>] cpu_idle+0x95/0xb8 Oct 23 00:28:24 localhost kernel: [<ffffffff80407807>] start_kernel+0x220/0x225 Oct 23 00:28:24 localhost kernel: [<ffffffff8040722f>] _sinittext+0x22f/0x236 Oct 23 00:28:24 localhost kernel: Oct 23 00:28:24 localhost kernel: handlers: Oct 23 00:28:24 localhost kernel: [<ffffffff801f819f>] (usb_hcd_irq+0x0/0x55) Oct 23 00:28:24 localhost kernel: [<ffffffff8821a05e>] (vp_interrupt+0x0/0xb8 [virtio_pci]) Oct 23 00:28:24 localhost kernel: Disabling IRQ #11 Oct 23 00:29:44 localhost kernel: INFO: task fdisk:2704 blocked for more than 120 seconds. Oct 23 00:29:44 localhost kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Oct 23 00:29:44 localhost kernel: fdisk D 0000000000000002 0 2704 2565 (NOTLB) Oct 23 00:29:44 localhost kernel: ffff810018f03c08 0000000000000086 ffff81000623f4b0 0000000000000046 Oct 23 00:29:44 localhost kernel: 0000000000000220 0000000000000007 ffff810018e3a860 ffff8100125ae100 Oct 23 00:29:44 localhost kernel: 0000007cf5989057 0000000000083b57 ffff810018e3aa48 000000001887c000 Oct 23 00:29:44 localhost kernel: Call Trace: Oct 23 00:29:44 localhost kernel: [<ffffffff8006f1f5>] do_gettimeofday+0x40/0x90 Oct 23 00:29:44 localhost kernel: [<ffffffff8005ad5a>] getnstimeofday+0x10/0x28 Oct 23 00:29:44 localhost kernel: [<ffffffff80028adc>] sync_page+0x0/0x43 Oct 23 00:29:44 localhost kernel: [<ffffffff800647ea>] io_schedule+0x3f/0x67 Oct 23 00:29:44 localhost kernel: [<ffffffff80028b1a>] sync_page+0x3e/0x43 Oct 23 00:29:44 localhost kernel: [<ffffffff8006492e>] __wait_on_bit_lock+0x36/0x66 Oct 23 00:29:44 localhost kernel: [<ffffffff8003ff92>] __lock_page+0x5e/0x64 Oct 23 00:29:44 localhost kernel: [<ffffffff800a1bd2>] wake_bit_function+0x0/0x23 Oct 23 00:29:44 localhost kernel: [<ffffffff8000c2e7>] do_generic_mapping_read+0x1df/0x354 Oct 23 00:29:44 localhost kernel: [<ffffffff8000d0fb>] file_read_actor+0x0/0x159 Oct 23 00:29:44 localhost kernel: [<ffffffff8000c5a8>] __generic_file_aio_read+0x14c/0x198 Oct 23 00:29:44 localhost kernel: [<ffffffff800c78fb>] generic_file_read+0xac/0xc5 Oct 23 00:29:44 localhost kernel: [<ffffffff8012b541>] inode_has_perm+0x56/0x63 Oct 23 00:29:44 localhost kernel: [<ffffffff800a1ba4>] autoremove_wake_function+0x0/0x2e Oct 23 00:29:44 localhost kernel: [<ffffffff80042489>] do_ioctl+0x21/0x6b Oct 23 00:29:44 localhost kernel: [<ffffffff8012e042>] selinux_file_permission+0x9f/0xb6 Oct 23 00:29:44 localhost kernel: [<ffffffff8000b6b0>] vfs_read+0xcb/0x171 Oct 23 00:29:44 localhost kernel: [<ffffffff80011c01>] sys_read+0x45/0x6e Oct 23 00:29:44 localhost kernel: [<ffffffff8005e28d>] tracesys+0xd5/0xe0 Oct 23 00:29:44 localhost kernel: Oct 23 00:29:44 localhost kernel: INFO: task fdisk:2721 blocked for more than 120 seconds. Oct 23 00:29:44 localhost kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Oct 23 00:29:44 localhost kernel: fdisk D ffff810001036420 0 2721 2705 (NOTLB) Oct 23 00:29:44 localhost kernel: ffff810018d89c08 0000000000000082 ffff810018f8d3d8 ffffffff80019bc9 Oct 23 00:29:44 localhost kernel: 000000000002007f 0000000000000008 ffff81001fc41860 ffffffff80309b60 Oct 23 00:29:44 localhost kernel: 000000854bfa21b4 000000000016df82 ffff81001fc41a48 0000000018f8d3d8 Oct 23 00:29:44 localhost kernel: Call Trace: Oct 23 00:29:44 localhost kernel: [<ffffffff80019bc9>] __getblk+0x25/0x236 Oct 23 00:29:44 localhost kernel: [<ffffffff8006f1f5>] do_gettimeofday+0x40/0x90 Oct 23 00:29:44 localhost kernel: [<ffffffff80028adc>] sync_page+0x0/0x43 Oct 23 00:29:44 localhost kernel: [<ffffffff800647ea>] io_schedule+0x3f/0x67 Oct 23 00:29:44 localhost kernel: [<ffffffff80028b1a>] sync_page+0x3e/0x43 Oct 23 00:29:44 localhost kernel: [<ffffffff8006492e>] __wait_on_bit_lock+0x36/0x66 Oct 23 00:29:44 localhost kernel: [<ffffffff8003ff92>] __lock_page+0x5e/0x64 Oct 23 00:29:44 localhost kernel: [<ffffffff800a1bd2>] wake_bit_function+0x0/0x23 Oct 23 00:29:44 localhost kernel: [<ffffffff8000c2e7>] do_generic_mapping_read+0x1df/0x354 Oct 23 00:29:44 localhost kernel: [<ffffffff8000d0fb>] file_read_actor+0x0/0x159 Oct 23 00:29:44 localhost kernel: [<ffffffff8000c5a8>] __generic_file_aio_read+0x14c/0x198 Oct 23 00:29:44 localhost kernel: [<ffffffff800c78fb>] generic_file_read+0xac/0xc5 Oct 23 00:29:44 localhost kernel: [<ffffffff8012b541>] inode_has_perm+0x56/0x63 Oct 23 00:29:44 localhost kernel: [<ffffffff800a1ba4>] autoremove_wake_function+0x0/0x2e Oct 23 00:29:44 localhost kernel: [<ffffffff80042489>] do_ioctl+0x21/0x6b Oct 23 00:29:44 localhost kernel: [<ffffffff8012e042>] selinux_file_permission+0x9f/0xb6 Oct 23 00:29:44 localhost kernel: [<ffffffff8000b6b0>] vfs_read+0xcb/0x171 Oct 23 00:29:44 localhost kernel: [<ffffffff80011c01>] sys_read+0x45/0x6e Oct 23 00:29:44 localhost kernel: [<ffffffff8005e28d>] tracesys+0xd5/0xe0 Oct 23 00:29:44 localhost kernel: Oct 23 00:30:12 localhost kernel: atkbd.c: Unknown key pressed (translated set 2, code 0x0 on isa0060/serio0). Oct 23 00:30:12 localhost kernel: atkbd.c: Use 'setkeycodes 00 <keycode>' to make it known. Oct 23 00:30:17 localhost kernel: atkbd.c: Unknown key released (translated set 2, code 0x0 on isa0060/serio0). Oct 23 00:30:17 localhost kernel: atkbd.c: Use 'setkeycodes 00 <keycode>' to make it known. Oct 23 00:30:32 localhost kernel: atkbd.c: Unknown key released (translated set 2, code 0x0 on isa0060/serio0). Oct 23 00:30:32 localhost kernel: atkbd.c: Use 'setkeycodes 00 <keycode>' to make it known. Oct 23 00:30:47 localhost kernel: atkbd.c: Unknown key released (translated set 2, code 0x0 on isa0060/serio0). Oct 23 00:30:47 localhost kernel: atkbd.c: Use 'setkeycodes 00 <keycode>' to make it known. Oct 23 00:31:24 localhost kernel: atkbd.c: Unknown key released (translated set 2, code 0x0 on isa0060/serio0). Oct 23 00:31:24 localhost kernel: atkbd.c: Use 'setkeycodes 00 <keycode>' to make it known. Oct 23 00:31:37 localhost gconfd (root-2431): Received signal 15, shutting down cleanly Oct 23 00:31:37 localhost gconfd (root-2431): Exiting Oct 23 00:31:49 localhost smartd[2225]: smartd received signal 15: Terminated Oct 23 00:31:49 localhost smartd[2225]: smartd is exiting (exit status 0) Oct 23 00:31:51 localhost avahi-daemon[2163]: Got SIGTERM, quitting. Oct 23 00:31:51 localhost avahi-daemon[2163]: Leaving mDNS multicast group on interface eth0.IPv6 with address fe80::5652:ff:fe6c:8dae. Oct 23 00:31:51 localhost avahi-daemon[2163]: Leaving mDNS multicast group on interface eth0.IPv4 with address 192.168.122.178. Oct 23 00:32:04 localhost xinetd[2041]: Exiting... Oct 23 00:32:10 localhost hcid[1854]: Got disconnected from the system message bus Oct 23 00:32:10 localhost nm-system-settings: disconnected from the system bus, exiting. Oct 23 00:32:10 localhost kernel: nm-system-setti[2509]: segfault at 0000000000000000 rip 0000000000000000 rsp 00007fff1bda2d58 error 14 Oct 23 00:32:11 localhost rpc.statd[1794]: Caught signal 15, un-registering and exiting. Oct 23 00:32:13 localhost restorecond: terminated Oct 23 00:32:13 localhost auditd[1669]: The audit daemon is exiting. Oct 23 00:32:13 localhost kernel: audit(1287765133.607:13): audit_pid=0 old=1669 by auid=4294967295 subj=system_u:system_r:auditd_t:s0 Oct 23 00:32:14 localhost pcscd: pcscdaemon.c:572:signal_trap() Preparing for suicide Oct 23 00:32:14 localhost pcscd: hotplug_libusb.c:376:HPRescanUsbBus() Hotplug stopped Oct 23 00:32:15 localhost pcscd: readerfactory.c:1379:RFCleanupReaders() entering cleaning function Oct 23 00:32:15 localhost pcscd: pcscdaemon.c:532:at_exit() cleaning /var/run Oct 23 00:32:15 localhost kernel: Kernel logging (proc) stopped. Oct 23 00:32:15 localhost kernel: Kernel log daemon terminating. Oct 23 00:32:15 localhost hcid[1854]: Can't open system message bus connection: Failed to connect to socket /var/run/dbus/system_bus_socket: Connection refused Oct 23 00:32:16 localhost exiting on signal 15 Oct 23 00:35:21 localhost syslogd 1.4.1: restart. Oct 23 00:35:22 localhost kernel: klogd 1.4.1, log source = /proc/kmsg started. Given the lack of required features on qemu-kvm side, we did what we could for this bug. It can work for some limited cases but can't work for others. Whether it can or cannot work depends not even on the type of device being hotplugged but also on devices that were present in a guest when it was started. We can't really do more about it. |