Description of problem: While migrating paused guests to and from file (as via "virsh save" / "virsh restore"), kvm's networking support can get stuck in a mode in which packets can be sent by the guest but not received. This has been observed with virtio_net; it is not presently known whether the issue can be reproduced with other network adapters (ie. e1000). "ip link down" and "ip link up" within the guest does not clear the issue. Removing and reinstalling the virtio_net module, a burst of packets is briefly received, then the prior state resumes. virtio-blk devices configured within the same guest continue to work without issue. On the host, the tx_overrun counter for the tap device used by the guest in question occasionally increments if the host attempts to send a sufficiently large number of packets. strace'ing qemu-kvm, no select() calls appear to be occurring with the file descriptor for the tap device in their argument list. Version-Release number of selected component (if applicable): kvm-83-105.el5_4.9 kmod-kvm-83-105.el5_4.9 kernel-2.6.18-164.2.1.el5 (kernel version is for both guest and host) How reproducible: Happens 100% of the time when restoring a saved VM state on which the issue has triggered, but occurs on a fairly low percentage of save/restore cycles. Reproduction is part of an automated QA environment; all OS provisioning and software installation steps leading up to the triggering of this bug are automated (including the points in the install process at which migrate-to-file operations occur), but even so the bug does not trigger with any reliability. Steps to Reproduce: 1. Load a VM state / disk image combo provided by the Dell MessageOne systems engineering team as exhibiting this problem -or- 1. Run save/restore cycles on virtual machines with virtio network adapters using tap-based networking until one restores in such a state as to be unable to communicate with the outside world. 2. Run tcpdump or equivalent tool on both sides 3. Observe that ARP requests from the guest are seen and responded to by the host, but that the receive counter for the guest's ethernet device does not increment. While the reproducer we have right now has a libvirt header on the ramsave file, we are happy to strip that and provide a reproducer which can be run against raw kvm without libvirt if such is preferred. Additional info: Marking "Dell Confidential" as reproduction materials may include software confidential to Dell MessageOne. This issue was not seen when using upstream qemu-kvm-0.11.0 prior to migration to RHEL5.4's virtualization infrastructure. Command line: /usr/libexec/qemu-kvm -S -M pc -m 384 -smp 1 -name fvte-140dd98a761361aea78a6b105ee018413e270738 -uuid 140dd98a-7613-61ae-a78a-6b105ee01841 -monitor unix:/var/lib/libvirt/qemu/fvte-140dd98a761361aea78a6b105ee018413e270738.monitor,server,nowait -no-reboot -boot c -drive file=/local/fvte-q/.fvte/states/vms/140dd98a761361aea78a6b105ee018413e270738/disks/da.qcow2,if=virtio,index=0,boot=on,format=qcow2,cache=none -drive file=,if=floppy,index=0 -drive file=,if=ide,media=cdrom,index=2 -net nic,macaddr=00:16:3e:c6:39:f3,vlan=0,model=virtio -net tap,fd=18,vlan=0 -serial file:/local/fvte-q/.fvte/states/vms/140dd98a761361aea78a6b105ee018413e270738/log/console.log -serial pty -parallel none -usb -vnc 127.0.0.1:0 -vga cirrus -incoming "exec:cat && { echo 'MIGRATION' 'DONE'; } >&2" Various data from qemu monitor console: (qemu) info network VLAN 0 devices: tap.0: fd=18 virtio.0: model=virtio,macaddr=00:16:3e:c6:39:f3 (qemu) info pci Bus 0, device 0, function 0: Host bridge: PCI device 8086:1237 Bus 0, device 1, function 0: ISA bridge: PCI device 8086:7000 Bus 0, device 1, function 1: IDE controller: PCI device 8086:7010 BAR4: I/O at 0xc000 [0xc00f]. Bus 0, device 1, function 2: USB controller: PCI device 8086:7020 IRQ 11. BAR4: I/O at 0xc020 [0xc03f]. Bus 0, device 1, function 3: Bridge: PCI device 8086:7113 IRQ 9. Bus 0, device 2, function 0: VGA controller: PCI device 1013:00b8 BAR0: 32 bit memory at 0xc2000000 [0xc3ffffff]. BAR1: 32 bit memory at 0xc4000000 [0xc4000fff]. Bus 0, device 3, function 0: Ethernet controller: PCI device 1af4:1000 IRQ 11. BAR0: I/O at 0xc040 [0xc05f]. Bus 0, device 4, function 0: SCSI controller: PCI device 1af4:1001 IRQ 11. BAR0: I/O at 0xc080 [0xc0bf]. Bus 0, device 5, function 0: RAM controller: PCI device 1af4:1002 IRQ 10. BAR0: I/O at 0xc0c0 [0xc0df].
Removing from "Dell Confidential". Will transmit reproducer out-of-band if necessary.
Built a debug version of kvm (changed -O2 -g in CFLAGS to -O1 -ggdb) to inspect. Comparing the VirtIONet struct from a single pair of known-good and known-bad savevm images, the following differences jump out at me. From the good image: n->vdev.isr=1 n->vdev.pci_dev.irq_state = {1,0,0,0} n->mergeable_rx_bufs=0 From the bad image: n->vdev.isr=0 n->vdev.pci_dev.irq_state = {0,0,0,0} n->mergeable_rx_bufs=1
I have a packaged reproducer, with the libvirt dependency removed, containing one known-good and one known-bad sample VM. Its run script has support for, among other things, invoking qemu-kvm through gdb. As these VMs were taken from our automated testing system, rather than built as minimal reproducers from the ground up, they're a bit larger than what an ideal minimal testcase might comprise -- the archive containing them weighs in at 741MB; with its content decompressed, 4.3GB of working space is needed. Is this likely to be of use to 'yall? If so, is there somewhere I should upload it?
Yes, it will be very helpful to get good and bad images so that we can look at them. You can upload the files to dropbox: http://kbase.redhat.com/faq/docs/DOC-2113 After you do, please provide exact filenames, MD5 or SHA1 message digest of the uploaded files.
(In reply to comment #4) > Yes, it will be very helpful to get good and bad images > so that we can look at them. You can upload the files to > dropbox: > http://kbase.redhat.com/faq/docs/DOC-2113 > > After you do, please provide exact filenames, MD5 or SHA1 message digest of the > uploaded files. rhbz531958-reproducer.pax MD5: bc5b1fc8beb1431fcc27d19d1ed7fc50 SHA1: d07029c457d4a3be36d8c85676b09208261d42fc
I downloaded the files and the hash matches. I unpacked the pax archive, but I have trouble decompressing both disk and ram images. the error I get is: Unexpected end of input $ sha1sum rhbz531958-reproducer.pax d07029c457d4a3be36d8c85676b09208261d42fc rhbz531958-reproducer.pax $ md5sum rhbz531958-reproducer.pax bc5b1fc8beb1431fcc27d19d1ed7fc50 rhbz531958-reproducer.pax $ pax -rvf rhbz531958-reproducer.pax rhbz531958-reproducer rhbz531958-reproducer/run rhbz531958-reproducer/net.setup rhbz531958-reproducer/data.known_good rhbz531958-reproducer/data.known_good/ramsave.xml rhbz531958-reproducer/data.known_good/env rhbz531958-reproducer/data.known_good/ramsave.raw.xz rhbz531958-reproducer/data.known_good/ramsave.argv rhbz531958-reproducer/data.known_good/da.qcow2.xz rhbz531958-reproducer/data.known_bad rhbz531958-reproducer/data.known_bad/ramsave.xml rhbz531958-reproducer/data.known_bad/env rhbz531958-reproducer/data.known_bad/ramsave.raw.xz rhbz531958-reproducer/data.known_bad/da.qcow2.xz rhbz531958-reproducer/README rhbz531958-reproducer/net.up pax: ustar vol 1, 16 files, 775176192 bytes read, 0 bytes written. $ cd rhbz531958-reproducer/ $ xz -k -d data.known_bad/da.qcow2.xz xz: data.known_bad/da.qcow2.xz: Unexpected end of input Am I doing the right thing? Is this a problem with the uploaded files? Thanks!
The issue is on my end -- the .xz files packaged in that pax archive were indeed corrupt. A corrected version is uploading presently. rhbz531958-reproducer.pax Len: 2528860160 (2.4G) MD5: 700723a25aec72a66ba725dd0eeace52 SHA1: f085d7ae237df765ce8a6157dba538c4b5be6d12
I think this is fixed in latest kvm: kvm-83-105.el5_4.22 Specifically, after running this command: env DATA_DIR=data.known_bad ./run -nographic I was able to ssh into guest at address 192.168.0.2 In other words, after updating kvm, it can load the image and networking works. Charles, could you confirm this please?
No info yet, so - postpone for 5.6
Since it is too late to address this issue in RHEL 5.5, it has been proposed for RHEL 5.6. Contact your support representative if you need to escalate this issue.