Bug 1447935
Summary: | Windows guest hang after migration from rhel7.3.z->rhel7.4 with e1000e nic card | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | huiqingding <huding> |
Component: | qemu-kvm-rhev | Assignee: | Dr. David Alan Gilbert <dgilbert> |
Status: | CLOSED DUPLICATE | QA Contact: | huiqingding <huding> |
Severity: | unspecified | Docs Contact: | |
Priority: | unspecified | ||
Version: | 7.4 | CC: | huding, jasowang, knoel, peterx, quintela, virt-maint |
Target Milestone: | rc | ||
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2017-05-18 10:01:38 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 1376765 |
Description
huiqingding
2017-05-04 09:26:00 UTC
Test win2016 guest, not hit this issue. Hi, Please can you test some things: a) Does this fail 7.4->7.4 but otherwise the same? b) Does this fail 7.3->7.3 but otherwise the same? c) On 7.3->7.4 if you use virtio-net or e1000 instead of e1000e does it fail? d) Please give details of your two hosts; which CPU etc Dave > a) Does this fail 7.4->7.4 but otherwise the same? Test 7.4->7.4, after migration win8.1-32 guest works well and does not hang. > b) Does this fail 7.3->7.3 but otherwise the same? Test 7.3.z->7.3.z, after migration win8.1-32 guest works well and does not hang. > c) On 7.3->7.4 if you use virtio-net or e1000 instead of e1000e does it > fail? Test 7.3.z->7.4, use e1000 and virtio-net, not hit this issue. > d) Please give details of your two hosts; which CPU etc 7.3.z host: # lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 8 On-line CPU(s) list: 0-7 Thread(s) per core: 2 Core(s) per socket: 4 Socket(s): 1 NUMA node(s): 1 Vendor ID: GenuineIntel CPU family: 6 Model: 58 Model name: Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz Stepping: 9 CPU MHz: 2357.156 BogoMIPS: 6785.04 Virtualization: VT-x L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 8192K NUMA node0 CPU(s): 0-7 7.4 host: # lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 8 On-line CPU(s) list: 0-7 Thread(s) per core: 2 Core(s) per socket: 4 Socket(s): 1 NUMA node(s): 1 Vendor ID: GenuineIntel CPU family: 6 Model: 58 Model name: Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz Stepping: 9 CPU MHz: 3630.828 BogoMIPS: 6784.81 Virtualization: VT-x L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 8192K NUMA node0 CPU(s): 0-7 I can certainly reproduce something weird happening, although it seems a bit more odd than a straight crash. What I'm seeing is that it migrates OK and works for a while and then mostly stops - the mouse keeps moving but time stops incrementing and you can't do anything with the GUI. If you poke it for a while it might wake up and do some more sometimes. I'm using: /usr/libexec/qemu-kvm -machine pc-i440fx-rhel7.3.0,accel=kvm,usb=off -m 4096 -cpu SandyBridge,check -realtime mlock=off -smp 4,maxcpus=4,sockets=4,cores=1,threads=1 -nodefaults -uuid 49a3438a-70a3-4ba8-92ce-3a05e0934608 -rtc base=utc,driftfix=slew -boot order=c,menu=on,strict=on -drive file=/home/vms/huding-win8-32.1-virtio.qcow2,if=none,id=drive-virtio-disk0,format=qcow2,serial=f65effa5-90a6-47f2-8487-a9f64c95d4f5,cache=none,discard=unmap,werror=stop,rerror=stop,aio=threads -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0 -netdev tap,id=un,vhost=on,fd=20 -device e1000e,netdev=un,id=virtio-net-pci2,mac=52:54:00:61:3f:43,bus=pci.0,addr=0x5,multifunction=off -monitor stdio -vga qxl -vnc :0 20<>/dev/tap15 and replacing the e1000e with an e1000 and it seems happy; so seems like the same bug. I can reproduce the symptom I see going from 2.8->2.9 upstream with the commandline: /opt/qemu/v2.9.0/bin/qemu-system-x86_64 -machine pc-i440fx-2.6,accel=kvm,usb=off -m 4096 -cpu SandyBridge,check -realtime mlock=off -smp 4,maxcpus=4,sockets=4,cores=1,threads=1 -nodefaults -uuid 49a3438a-70a3-4ba8-92ce-3a05e0934608 -rtc base=utc,driftfix=slew -boot order=c,menu=on,strict=on -drive file=/home/vms/huding-win8-32.1-virtio.qcow2,if=none,id=drive-virtio-disk0,format=qcow2,serial=f65effa5-90a6-47f2-8487-a9f64c95d4f5,cache=none,discard=unmap,werror=stop,rerror=stop,aio=threads -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0 -netdev tap,id=un,vhost=on,fd=20 -device e1000e,netdev=un,id=virtio-net-pci2,mac=52:54:00:61:3f:43,bus=pci.0,addr=0x5,multifunction=off -monitor stdio -vga qxl -vnc :1 20<>/dev/tap15 -incoming tcp:0:4444 A bit of bisecting. I'm running an upstream 2.9.0 and migrating stuff after 2.8.0 to it and seeing what hangs. I'm having to revert 07bfa354772 as per bz 1434784 413:v2.8.0-2168-g7ec7ae4b97 -> 414:2.9 ! Good ? + revert of 07bfa354772 as per bz 1434784 413:v2.8.0-1677-g3a5eb5b + apic revert ! Good ? Test repeated, still good 413:v2.8.0-1423-g9eaaf97 + apic revert - broken 413:v2.8.0-1502-g6959e45 + apic revert - broken 413:v2.8.0-1598-g6b4e463 + apic revert - broken 413:v2.8.0-1628-g6e86d90 + apic revert - broken 413:v2.8.0-1648-g5db53e3 + apic revert - broken 413:v2.8.0-1665-g3651c28 + apic revert - Good! Good again 413:v2.8.0-1649-g43ddc18 + apic revert - broken 413:v2.8.0-1664-g1bbe5dc + apic revert - initial response, but then hang? So bad? - try again, broken So it looks like it's something between 1664 and 1665 - but their really doesn't seem anything relevant. bz 1449490 is another possibly e1000e on windows migration bug - but this time 7.4->7.4 the e1000e interrupt rate on the destination is much much higher than on the source - perhaps windows is just being starved of any useful work by the interrupt rate? The source is seeing ~10/second (e1000e_irq_pending_interrupts trace where the pending value is non-zero) where as the destination is seeing hundreds. The interrupts on the destination seem to be 'other' interrupts: 31231507 | | | | 23004:e1000e_irq_pending_interrupts ICR PENDING: 0x1000000 (ICR: 0x815000c2, IMS: 0x1a00004) I've tested this with Sameeh's fix from bz 1449490 and it seems to fix it; (Brew task 13224735) so marking it as a dupe of that *** This bug has been marked as a duplicate of bug 1449490 *** |