1447935 – Windows guest hang after migration from rhel7.3.z->rhel7.4 with e1000e nic card

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1447935 - Windows guest hang after migration from rhel7.3.z->rhel7.4 with e1000e nic card

Summary: Windows guest hang after migration from rhel7.3.z->rhel7.4 with e1000e nic card

Keywords:
Status:	CLOSED DUPLICATE of bug 1449490
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	qemu-kvm-rhev
Sub Component:
Version:	7.4
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	rc
Target Release:	---
Assignee:	Dr. David Alan Gilbert
QA Contact:	huiqingding
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1376765
TreeView+	depends on / blocked

Reported:	2017-05-04 09:26 UTC by huiqingding
Modified:	2017-05-18 10:01 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2017-05-18 10:01:38 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description huiqingding 2017-05-04 09:26:00 UTC

Description of problem:
win8.1-32 guest hang after migration from rhel7.3.z->rhel7.4 with e1000e nic card, the mouse can move, but from the third host, cannot ping the guest.

win2012r2 guest hangs about 5-6 minutes, then the guest works well.

Version-Release number of selected component (if applicable):
rhel7.3.z host:
kernel-3.10.0-514.18.1.el7.x86_64
qemu-img-rhev-2.6.0-28.el7_3.9.x86_64

rhel7.4 host:
kernel-3.10.0-663.el7.x86_64
qemu-kvm-rhev-2.9.0-2.el7.x86_64 

How reproducible:
100%

Steps to Reproduce:
0. sync clock with ntp server on the source and destination hosts:
# ntpdate clock.redhat.com

1. boot guest in rhel7.3.z host
# /usr/libexec/qemu-kvm \
-name rhel7 \
-machine pc-i440fx-rhel7.3.0,accel=kvm,usb=off \
-m 4096 \
-cpu SandyBridge,check \
-realtime mlock=off \
-smp 4,maxcpus=4,sockets=4,cores=1,threads=1 \
-uuid 49a3438a-70a3-4ba8-92ce-3a05e0934608 \
-nodefaults \
-rtc base=utc,driftfix=slew \
-boot order=c,menu=on,strict=on \
-drive file=/mnt/stable_guest_abi/win2012-64r2-virtio.qcow2,if=none,id=drive-virtio-disk0,format=qcow2,serial=f65effa5-90a6-47f2-8487-a9f64c95d4f5,cache=none,discard=unmap,werror=stop,rerror=stop,aio=threads \
-device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0 \
-netdev tap,id=hostnet2,vhost=on,script=/etc/qemu-ifup \
-device e1000e,netdev=hostnet2,id=virtio-net-pci2,mac=4e:63:28:bc:c1:75,bus=pci.0,addr=0x5,multifunction=off \
-monitor stdio \
-qmp tcp:0:4466,server,nowait -serial unix:/tmp/ttym,server,nowait \
-vga qxl \
-vnc :1

2. boot guest in rhel7.4 host with "-incoming tcp:0:5800"

3. do migration

Actual results:
after migration, win8.1-32 guest hangs and the third host cannot ping it.
win2012r2 guest hangs about 5-6 minutes and the third host cannot ping it. Then the guest works well, the third host can ping it.

Expected results:
after migration. the guest does not hang.

Additional info:
Test qemu-kvm-rhev-2.8.0-5.el7.x86_64, also hit this issue.

Remove e1000e nic card from the command line, start the guest and do migration, the guest does not hang and works well.

Comment 2 huiqingding 2017-05-04 09:32:54 UTC

Test win2016 guest, not hit this issue.

Comment 3 Dr. David Alan Gilbert 2017-05-04 16:06:05 UTC

Hi,
  Please can you test some things:
     a) Does this fail 7.4->7.4 but otherwise the same?
     b) Does this fail 7.3->7.3 but otherwise the same?
     c) On 7.3->7.4 if you use virtio-net or e1000 instead of e1000e does it fail?
     d) Please give details of your two hosts; which CPU etc

Dave

Comment 4 huiqingding 2017-05-09 02:55:58 UTC

>      a) Does this fail 7.4->7.4 but otherwise the same?
Test 7.4->7.4, after migration win8.1-32 guest works well and does not hang.

>      b) Does this fail 7.3->7.3 but otherwise the same?
Test 7.3.z->7.3.z, after migration win8.1-32 guest works well and does not hang.

>      c) On 7.3->7.4 if you use virtio-net or e1000 instead of e1000e does it
> fail?
Test 7.3.z->7.4, use e1000 and virtio-net, not hit this issue.

>      d) Please give details of your two hosts; which CPU etc
7.3.z host:
# lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                8
On-line CPU(s) list:   0-7
Thread(s) per core:    2
Core(s) per socket:    4
Socket(s):             1
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 58
Model name:            Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz
Stepping:              9
CPU MHz:               2357.156
BogoMIPS:              6785.04
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              8192K
NUMA node0 CPU(s):     0-7

7.4 host:
# lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                8
On-line CPU(s) list:   0-7
Thread(s) per core:    2
Core(s) per socket:    4
Socket(s):             1
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 58
Model name:            Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz
Stepping:              9
CPU MHz:               3630.828
BogoMIPS:              6784.81
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              8192K
NUMA node0 CPU(s):     0-7

Comment 5 Dr. David Alan Gilbert 2017-05-09 15:37:51 UTC

I can certainly reproduce something weird happening, although it seems a bit more odd than a straight crash.
What I'm seeing is that it migrates OK and works for a while and then mostly stops - the mouse keeps moving but time stops incrementing and you can't do anything with the GUI.   If you poke it for a while it might wake up and do some more sometimes.

I'm using:

/usr/libexec/qemu-kvm -machine pc-i440fx-rhel7.3.0,accel=kvm,usb=off -m 4096  -cpu SandyBridge,check -realtime mlock=off -smp 4,maxcpus=4,sockets=4,cores=1,threads=1 -nodefaults -uuid 49a3438a-70a3-4ba8-92ce-3a05e0934608 -rtc base=utc,driftfix=slew -boot order=c,menu=on,strict=on -drive file=/home/vms/huding-win8-32.1-virtio.qcow2,if=none,id=drive-virtio-disk0,format=qcow2,serial=f65effa5-90a6-47f2-8487-a9f64c95d4f5,cache=none,discard=unmap,werror=stop,rerror=stop,aio=threads -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0 -netdev tap,id=un,vhost=on,fd=20 -device e1000e,netdev=un,id=virtio-net-pci2,mac=52:54:00:61:3f:43,bus=pci.0,addr=0x5,multifunction=off -monitor stdio -vga qxl -vnc :0 20<>/dev/tap15

Comment 6 Dr. David Alan Gilbert 2017-05-09 15:53:32 UTC

and replacing the e1000e with an e1000 and it seems happy; so seems like the same bug.

Comment 7 Dr. David Alan Gilbert 2017-05-10 18:01:22 UTC

I can reproduce the symptom I see going from 2.8->2.9 upstream with the commandline:

/opt/qemu/v2.9.0/bin/qemu-system-x86_64 -machine pc-i440fx-2.6,accel=kvm,usb=off -m 4096  -cpu SandyBridge,check -realtime mlock=off -smp 4,maxcpus=4,sockets=4,cores=1,threads=1 -nodefaults -uuid 49a3438a-70a3-4ba8-92ce-3a05e0934608 -rtc base=utc,driftfix=slew -boot order=c,menu=on,strict=on -drive file=/home/vms/huding-win8-32.1-virtio.qcow2,if=none,id=drive-virtio-disk0,format=qcow2,serial=f65effa5-90a6-47f2-8487-a9f64c95d4f5,cache=none,discard=unmap,werror=stop,rerror=stop,aio=threads -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0 -netdev tap,id=un,vhost=on,fd=20 -device e1000e,netdev=un,id=virtio-net-pci2,mac=52:54:00:61:3f:43,bus=pci.0,addr=0x5,multifunction=off -monitor stdio -vga qxl -vnc :1 20<>/dev/tap15 -incoming tcp:0:4444

Comment 8 Dr. David Alan Gilbert 2017-05-11 10:27:55 UTC

A bit of bisecting.
I'm running an upstream 2.9.0 and migrating stuff after 2.8.0 to it and seeing what hangs.  I'm having to revert 07bfa354772 as per bz 1434784

       413:v2.8.0-2168-g7ec7ae4b97 -> 414:2.9  ! Good ?
               + revert of 07bfa354772 as per bz 1434784
 
       413:v2.8.0-1677-g3a5eb5b + apic revert  ! Good ? Test repeated, still good

       413:v2.8.0-1423-g9eaaf97 + apic revert - broken
 
       413:v2.8.0-1502-g6959e45 + apic revert - broken
 
       413:v2.8.0-1598-g6b4e463 + apic revert - broken
 
       413:v2.8.0-1628-g6e86d90 + apic revert - broken
 
       413:v2.8.0-1648-g5db53e3 + apic revert - broken
 
       413:v2.8.0-1665-g3651c28 + apic revert - Good! Good again
       413:v2.8.0-1649-g43ddc18 + apic revert - broken
       413:v2.8.0-1664-g1bbe5dc + apic revert - initial response, but then hang? So bad?
                                              - try again, broken

So it looks like it's something between 1664 and 1665 - but their really doesn't seem anything relevant.

Comment 9 Dr. David Alan Gilbert 2017-05-11 10:40:40 UTC

bz 1449490 is another possibly e1000e on windows migration bug - but this time 7.4->7.4

Comment 10 Dr. David Alan Gilbert 2017-05-11 11:49:26 UTC

the e1000e interrupt rate on the destination is much much higher than on the source - perhaps windows is just being starved of any useful work by the interrupt rate?

The source is seeing ~10/second (e1000e_irq_pending_interrupts trace where the pending value is non-zero) where as the destination is seeing hundreds.

Comment 11 Dr. David Alan Gilbert 2017-05-11 16:54:21 UTC

The interrupts on the destination seem to be 'other' interrupts:
                                                                   31231507
                                                                    | | | |
23004:e1000e_irq_pending_interrupts ICR PENDING: 0x1000000 (ICR: 0x815000c2, IMS: 0x1a00004)

Comment 12 Dr. David Alan Gilbert 2017-05-18 10:01:38 UTC

I've tested this with Sameeh's fix from bz 1449490 and it seems to fix it;
(Brew task 13224735)

so marking it as a dupe of that

*** This bug has been marked as a duplicate of bug 1449490 ***

Note You need to log in before you can comment on or make changes to this bug.