Bug 1447935

Summary:	Windows guest hang after migration from rhel7.3.z->rhel7.4 with e1000e nic card
Product:	Red Hat Enterprise Linux 7	Reporter:	huiqingding <huding>
Component:	qemu-kvm-rhev	Assignee:	Dr. David Alan Gilbert <dgilbert>
Status:	CLOSED DUPLICATE	QA Contact:	huiqingding <huding>
Severity:	unspecified	Docs Contact:
Priority:	unspecified
Version:	7.4	CC:	huding, jasowang, knoel, peterx, quintela, virt-maint
Target Milestone:	rc
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2017-05-18 10:01:38 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1376765

Description huiqingding 2017-05-04 09:26:00 UTC

Description of problem:
win8.1-32 guest hang after migration from rhel7.3.z->rhel7.4 with e1000e nic card, the mouse can move, but from the third host, cannot ping the guest.

win2012r2 guest hangs about 5-6 minutes, then the guest works well.

Version-Release number of selected component (if applicable):
rhel7.3.z host:
kernel-3.10.0-514.18.1.el7.x86_64
qemu-img-rhev-2.6.0-28.el7_3.9.x86_64

rhel7.4 host:
kernel-3.10.0-663.el7.x86_64
qemu-kvm-rhev-2.9.0-2.el7.x86_64 

How reproducible:
100%

Steps to Reproduce:
0. sync clock with ntp server on the source and destination hosts:
# ntpdate clock.redhat.com

1. boot guest in rhel7.3.z host
# /usr/libexec/qemu-kvm \
-name rhel7 \
-machine pc-i440fx-rhel7.3.0,accel=kvm,usb=off \
-m 4096 \
-cpu SandyBridge,check \
-realtime mlock=off \
-smp 4,maxcpus=4,sockets=4,cores=1,threads=1 \
-uuid 49a3438a-70a3-4ba8-92ce-3a05e0934608 \
-nodefaults \
-rtc base=utc,driftfix=slew \
-boot order=c,menu=on,strict=on \
-drive file=/mnt/stable_guest_abi/win2012-64r2-virtio.qcow2,if=none,id=drive-virtio-disk0,format=qcow2,serial=f65effa5-90a6-47f2-8487-a9f64c95d4f5,cache=none,discard=unmap,werror=stop,rerror=stop,aio=threads \
-device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0 \
-netdev tap,id=hostnet2,vhost=on,script=/etc/qemu-ifup \
-device e1000e,netdev=hostnet2,id=virtio-net-pci2,mac=4e:63:28:bc:c1:75,bus=pci.0,addr=0x5,multifunction=off \
-monitor stdio \
-qmp tcp:0:4466,server,nowait -serial unix:/tmp/ttym,server,nowait \
-vga qxl \
-vnc :1

2. boot guest in rhel7.4 host with "-incoming tcp:0:5800"

3. do migration

Actual results:
after migration, win8.1-32 guest hangs and the third host cannot ping it.
win2012r2 guest hangs about 5-6 minutes and the third host cannot ping it. Then the guest works well, the third host can ping it.

Expected results:
after migration. the guest does not hang.

Additional info:
Test qemu-kvm-rhev-2.8.0-5.el7.x86_64, also hit this issue.

Remove e1000e nic card from the command line, start the guest and do migration, the guest does not hang and works well.

Comment 2 huiqingding 2017-05-04 09:32:54 UTC

Test win2016 guest, not hit this issue.

Comment 3 Dr. David Alan Gilbert 2017-05-04 16:06:05 UTC

Hi,
  Please can you test some things:
     a) Does this fail 7.4->7.4 but otherwise the same?
     b) Does this fail 7.3->7.3 but otherwise the same?
     c) On 7.3->7.4 if you use virtio-net or e1000 instead of e1000e does it fail?
     d) Please give details of your two hosts; which CPU etc

Dave

Comment 4 huiqingding 2017-05-09 02:55:58 UTC

>      a) Does this fail 7.4->7.4 but otherwise the same?
Test 7.4->7.4, after migration win8.1-32 guest works well and does not hang.

>      b) Does this fail 7.3->7.3 but otherwise the same?
Test 7.3.z->7.3.z, after migration win8.1-32 guest works well and does not hang.

>      c) On 7.3->7.4 if you use virtio-net or e1000 instead of e1000e does it
> fail?
Test 7.3.z->7.4, use e1000 and virtio-net, not hit this issue.

>      d) Please give details of your two hosts; which CPU etc
7.3.z host:
# lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                8
On-line CPU(s) list:   0-7
Thread(s) per core:    2
Core(s) per socket:    4
Socket(s):             1
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 58
Model name:            Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz
Stepping:              9
CPU MHz:               2357.156
BogoMIPS:              6785.04
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              8192K
NUMA node0 CPU(s):     0-7

7.4 host:
# lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                8
On-line CPU(s) list:   0-7
Thread(s) per core:    2
Core(s) per socket:    4
Socket(s):             1
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 58
Model name:            Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz
Stepping:              9
CPU MHz:               3630.828
BogoMIPS:              6784.81
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              8192K
NUMA node0 CPU(s):     0-7

Comment 5 Dr. David Alan Gilbert 2017-05-09 15:37:51 UTC

I can certainly reproduce something weird happening, although it seems a bit more odd than a straight crash.
What I'm seeing is that it migrates OK and works for a while and then mostly stops - the mouse keeps moving but time stops incrementing and you can't do anything with the GUI.   If you poke it for a while it might wake up and do some more sometimes.

I'm using:

/usr/libexec/qemu-kvm -machine pc-i440fx-rhel7.3.0,accel=kvm,usb=off -m 4096  -cpu SandyBridge,check -realtime mlock=off -smp 4,maxcpus=4,sockets=4,cores=1,threads=1 -nodefaults -uuid 49a3438a-70a3-4ba8-92ce-3a05e0934608 -rtc base=utc,driftfix=slew -boot order=c,menu=on,strict=on -drive file=/home/vms/huding-win8-32.1-virtio.qcow2,if=none,id=drive-virtio-disk0,format=qcow2,serial=f65effa5-90a6-47f2-8487-a9f64c95d4f5,cache=none,discard=unmap,werror=stop,rerror=stop,aio=threads -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0 -netdev tap,id=un,vhost=on,fd=20 -device e1000e,netdev=un,id=virtio-net-pci2,mac=52:54:00:61:3f:43,bus=pci.0,addr=0x5,multifunction=off -monitor stdio -vga qxl -vnc :0 20<>/dev/tap15

Comment 6 Dr. David Alan Gilbert 2017-05-09 15:53:32 UTC

and replacing the e1000e with an e1000 and it seems happy; so seems like the same bug.

Comment 7 Dr. David Alan Gilbert 2017-05-10 18:01:22 UTC

I can reproduce the symptom I see going from 2.8->2.9 upstream with the commandline:

/opt/qemu/v2.9.0/bin/qemu-system-x86_64 -machine pc-i440fx-2.6,accel=kvm,usb=off -m 4096  -cpu SandyBridge,check -realtime mlock=off -smp 4,maxcpus=4,sockets=4,cores=1,threads=1 -nodefaults -uuid 49a3438a-70a3-4ba8-92ce-3a05e0934608 -rtc base=utc,driftfix=slew -boot order=c,menu=on,strict=on -drive file=/home/vms/huding-win8-32.1-virtio.qcow2,if=none,id=drive-virtio-disk0,format=qcow2,serial=f65effa5-90a6-47f2-8487-a9f64c95d4f5,cache=none,discard=unmap,werror=stop,rerror=stop,aio=threads -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0 -netdev tap,id=un,vhost=on,fd=20 -device e1000e,netdev=un,id=virtio-net-pci2,mac=52:54:00:61:3f:43,bus=pci.0,addr=0x5,multifunction=off -monitor stdio -vga qxl -vnc :1 20<>/dev/tap15 -incoming tcp:0:4444

Comment 8 Dr. David Alan Gilbert 2017-05-11 10:27:55 UTC

A bit of bisecting.
I'm running an upstream 2.9.0 and migrating stuff after 2.8.0 to it and seeing what hangs.  I'm having to revert 07bfa354772 as per bz 1434784

       413:v2.8.0-2168-g7ec7ae4b97 -> 414:2.9  ! Good ?
               + revert of 07bfa354772 as per bz 1434784
 
       413:v2.8.0-1677-g3a5eb5b + apic revert  ! Good ? Test repeated, still good

       413:v2.8.0-1423-g9eaaf97 + apic revert - broken
 
       413:v2.8.0-1502-g6959e45 + apic revert - broken
 
       413:v2.8.0-1598-g6b4e463 + apic revert - broken
 
       413:v2.8.0-1628-g6e86d90 + apic revert - broken
 
       413:v2.8.0-1648-g5db53e3 + apic revert - broken
 
       413:v2.8.0-1665-g3651c28 + apic revert - Good! Good again
       413:v2.8.0-1649-g43ddc18 + apic revert - broken
       413:v2.8.0-1664-g1bbe5dc + apic revert - initial response, but then hang? So bad?
                                              - try again, broken

So it looks like it's something between 1664 and 1665 - but their really doesn't seem anything relevant.

Comment 9 Dr. David Alan Gilbert 2017-05-11 10:40:40 UTC

bz 1449490 is another possibly e1000e on windows migration bug - but this time 7.4->7.4

Comment 10 Dr. David Alan Gilbert 2017-05-11 11:49:26 UTC

the e1000e interrupt rate on the destination is much much higher than on the source - perhaps windows is just being starved of any useful work by the interrupt rate?

The source is seeing ~10/second (e1000e_irq_pending_interrupts trace where the pending value is non-zero) where as the destination is seeing hundreds.

Comment 11 Dr. David Alan Gilbert 2017-05-11 16:54:21 UTC

The interrupts on the destination seem to be 'other' interrupts:
                                                                   31231507
                                                                    | | | |
23004:e1000e_irq_pending_interrupts ICR PENDING: 0x1000000 (ICR: 0x815000c2, IMS: 0x1a00004)

Comment 12 Dr. David Alan Gilbert 2017-05-18 10:01:38 UTC

I've tested this with Sameeh's fix from bz 1449490 and it seems to fix it;
(Brew task 13224735)

so marking it as a dupe of that

*** This bug has been marked as a duplicate of bug 1449490 ***