Bug 874406

Summary:

windows server 2012 with e1000 stuck at 100% cpu during reboot

Product:

Red Hat Enterprise Linux 6

Reporter:

Mike Cao <bcao>

Component:

qemu-kvm

Assignee:

Andrew Jones <drjones>

Status:

CLOSED WONTFIX

QA Contact:

Virtualization Bugs <virt-bugs>

Severity:

high

Docs Contact:

Priority:

urgent

Version:

6.4

CC:

acathrow, areis, bcao, bsarathy, ddumas, dyasny, jasowang, juzhang, knoel, michen, mkenneth, mst, perfbz, qzhang, rhod, virt-maint, xfu, yunzheng, yvugenfi

Target Milestone:

Target Release:

---

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

Known Issue

Doc Text:

Microsoft Windows Server 2012 guests using the e1000 driver can become unresponsive consuming 100% CPU during reboot/boot.

Story Points:

---

Clone Of:

Environment:

Last Closed:

2013-01-23 16:41:54 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
ftrace log for this bug.	none
The log is a little large, upload a tar compressed file instead	none
ftrace log (kill other vm and only leave 1)	none
e1000 logs, one normal set and one set with the problem	none
reproducer script with trace sync	none
traces (ftrace and qemu log) made with trace sync	none
setupapi.dev.log for the same failure as traces.tgz	none
hack to deal with RXT0 set_ics calls that have IMS=0	none
this is a simple hack: after reset, disable rxt0 interrupts until RDT is set	none
Possible fix	none

Description Mike Cao 2012-11-08 06:06:51 UTC

Description of problem:
I am running SVVP Test over windows server 2012 platform ,Guest always stuck at 100% cpu during reboot .It is a testblock to my test .

Version-Release number of selected component (if applicable):
2.6.32-335.el6.x86_64
qemu-kvm-rhev-0.12.1.2-2.331.el6.x86_64


How reproducible:
80%

Steps to Reproduce:
1.Start 5 Windows server 2012 on the host .
2.Running SVVP Test (guest will copy some testing kit from the HCK server then reboot)

Actual results:
During reboot ,guest stuck at 100% CPU ,and can not login guest 
  
Expected results:
No stuck occurs

Additional info:
It is a testblocker to my SVVP Test.

Comment 2 Mike Cao 2012-11-08 07:59:41 UTC

CLI:
/usr/libexec/qemu-kvm -name ss4-1 -S -M rhel6.4.0 -enable-kvm -m 2048 -smp 1,sockets=1,cores=1,threads=1 -uuid eb28b9ee-894f-79c4-a64d-a883c435fd1f -nodefconfig -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/ss4-1.monitor,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc -no-shutdown -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -drive file=/var/svvp/windows_server_2012_1,if=none,id=drive-ide0-0-0,format=raw,cache=none -device ide-drive,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0,bootindex=1 -netdev tap,fd=22,id=hostnet0 -device e1000,netdev=hostnet0,id=net0,mac=52:54:00:48:34:80,bus=pci.0,addr=0x3 -chardev pty,id=charserial0 -device isa-serial,chardev=charserial0,id=serial0 -vnc 127.0.0.1:0 -vga cirrus -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x4

Comment 3 Qunfang Zhang 2012-11-13 07:23:44 UTC

Reproduced this issue even only boot 3 win2012-64 VMs (1G mem and 1 vcpu for each vm) on a 8G mem and 4 cpu host.
1) Guests will stuck when reboot multiple guests at the same time.
2) Re-test with win2k8-r2 guests with the same command line, does not reproduce after tried 5 times.

CLI:
# /usr/libexec/qemu-kvm -name ss4-1 -S -M rhel6.4.0 -enable-kvm -m 1024 -smp 1,sockets=1,cores=1,threads=1 -uuid eb28b9ee-894f-79c4-a64d-a883c435fd1f -nodefconfig -nodefaults -rtc base=utc -no-shutdown -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -drive file=/home/win2012-64.raw,if=none,id=drive-ide0-0-0,format=raw,cache=none -device ide-drive,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0,bootindex=1 -netdev tap,id=hostnet0 -device e1000,netdev=hostnet0,id=net0,mac=52:54:00:48:34:80,bus=pci.0,addr=0x3 -chardev pty,id=charserial0 -device isa-serial,chardev=charserial0,id=serial0 -vnc :0 -vga cirrus -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x4 -monitor stdio

Comment 4 Qunfang Zhang 2012-11-13 08:44:56 UTC

(In reply to comment #3)
> Reproduced this issue even only boot 3 win2012-64 VMs (1G mem and 1 vcpu for
> each vm) on a 8G mem and 4 cpu host.
> 1) Guests will stuck when reboot multiple guests at the same time.
> 2) Re-test with win2k8-r2 guests with the same command line, does not
> reproduce after tried 5 times.
> 
> CLI:
> # /usr/libexec/qemu-kvm -name ss4-1 -S -M rhel6.4.0 -enable-kvm -m 1024 -smp
> 1,sockets=1,cores=1,threads=1 -uuid eb28b9ee-894f-79c4-a64d-a883c435fd1f
> -nodefconfig -nodefaults -rtc base=utc -no-shutdown -device
> piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -drive
> file=/home/win2012-64.raw,if=none,id=drive-ide0-0-0,format=raw,cache=none
> -device
> ide-drive,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0,bootindex=1
> -netdev tap,id=hostnet0 -device
> e1000,netdev=hostnet0,id=net0,mac=52:54:00:48:34:80,bus=pci.0,addr=0x3
> -chardev pty,id=charserial0 -device
> isa-serial,chardev=charserial0,id=serial0 -vnc :0 -vga cirrus -device
> virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x4 -monitor stdio

Host package:
kernel-2.6.32-339.el6.x86_64
qemu-kvm-0.12.1.2-2.334.el6.x86_64

Top info during guest stuck:
  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                            
32420 root      20   0 1355m 1.0g 4536 S 100.1 13.4   3:22.07 qemu-kvm                                          
32405 root      20   0 1355m 1.0g 4536 S 99.7 13.4   3:37.39 qemu-kvm                                           
32436 root      20   0 1355m 1.0g 4536 S 97.7 13.4   2:41.26 qemu-kvm

Comment 5 Dor Laor 2012-11-15 12:38:59 UTC

Can you please provide a perf trace?

Comment 7 Qunfang Zhang 2012-11-16 04:05:47 UTC

Created attachment 646184 [details]
ftrace log for this bug.

Comment 8 Qunfang Zhang 2012-11-16 04:56:11 UTC

Created attachment 646186 [details]
The log is a little large, upload a tar compressed file instead

Comment 9 Andrew Jones 2012-11-19 10:49:04 UTC

(In reply to comment #3)
> Reproduced this issue even only boot 3 win2012-64 VMs (1G mem and 1 vcpu for
> each vm) on a 8G mem and 4 cpu host.
> 1) Guests will stuck when reboot multiple guests at the same time.

So the boot hang only occurs when more than one win2012-64 vm is booted simultaneously. I.e. when booting them individually they run fine? Does the problem also reproduce with a single vm that has more than one vcpu? And does it still reproduce with multiple vms, even when each vm's vcpus are pinned to unique pcpus?

> 2) Re-test with win2k8-r2 guests with the same command line, does not
> reproduce after tried 5 times.

So we've only seen this boot problem with one guest type (win2012-64) so far. Is this a regression? or has win2012-64 never been tested before?

Can somebody give me access to a system with these windows guests ready for testing? Or point me to where I can install them from? If so, then I can try the above experiments and whatever else comes to mind myself.

Comment 13 Andrew Jones 2012-11-21 17:18:07 UTC

Please grab /proc/cpuinfo from the machine this reproduced on.

Comment 14 Mike Cao 2012-11-22 02:36:53 UTC

(In reply to comment #13)
> Please grab /proc/cpuinfo from the machine this reproduced on.

Tried both AMD and intel host ,all cause this issue .
eg:
processor        : 3
vendor_id        : GenuineIntel
cpu family        : 6
model                : 42
model name        : Intel(R) Core(TM) i5-2400 CPU @ 3.10GHz
stepping        : 7
cpu MHz                : 1600.000
cache size        : 6144 KB
physical id        : 0
siblings        : 4
core id                : 3
cpu cores        : 4
apicid                : 6
initial apicid        : 6
fpu                : yes
fpu_exception        : yes
cpuid level        : 13
wp                : yes
flags                : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer xsave avx lahf_lm ida arat epb xsaveopt pln pts dts tpr_shadow vnmi flexpriority ept vpid
bogomips        : 6185.40
clflush size        : 64
cache_alignment        : 64
address sizes        : 36 bits physical, 48 bits virtual
power management:


processor        : 3
vendor_id        : AuthenticAMD
cpu family        : 16
model                : 2
model name        : AMD Phenom(tm) 9600B Quad-Core Processor
stepping        : 3
cpu MHz                : 2293.756
cache size        : 512 KB
physical id        : 0
siblings        : 4
core id                : 3
cpu cores        : 4
apicid                : 3
initial apicid        : 3
fpu                : yes
fpu_exception        : yes
cpuid level        : 5
wp                : yes
flags                : fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc rep_good nonstop_tsc extd_apicid pni monitor cx16 popcnt lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs npt lbrv svm_lock
bogomips        : 4587.51
TLB size        : 1024 4K pages
clflush size        : 64
cache_alignment        : 64
address sizes        : 48 bits physical, 48 bits virtual
power management: ts ttp tm stc 100mhzsteps hwpstate

I am sure qzhang tested it in a different hosts from mine,but not sure configuration of her host

Comment 15 Qunfang Zhang 2012-11-22 08:00:53 UTC

My host cpu configuration, seems same as bcao's Intel host.

processor	: 3
vendor_id	: GenuineIntel
cpu family	: 6
model		: 42
model name	: Intel(R) Core(TM) i5-2400 CPU @ 3.10GHz
stepping	: 7
cpu MHz		: 1600.000
cache size	: 6144 KB
physical id	: 0
siblings	: 4
core id		: 3
cpu cores	: 4
apicid		: 6
initial apicid	: 6
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer xsave avx lahf_lm ida arat epb xsaveopt pln pts dts tpr_shadow vnmi flexpriority ept vpid
bogomips	: 6185.62
clflush size	: 64
cache_alignment	: 64
address sizes	: 36 bits physical, 48 bits virtual
power management:

Comment 16 Miya Chen 2012-11-23 02:49:59 UTC

From Gleb:
So QE should reproduce, kill all other VMs
and run ftrace. Looking at Windows dump may provide us with a clue.

Comment 17 Qunfang Zhang 2012-11-23 06:49:03 UTC

(In reply to comment #16)
> From Gleb:
> So QE should reproduce, kill all other VMs
> and run ftrace. Looking at Windows dump may provide us with a clue.

Retest again, collecting the trace log after kill other vms, and there's only 1 vm running and it hangs.

Steps:
1. Boot 5 vms (each with 1G mem and 1 vcpu) on a 8G mem and 4 cpu host.
Command line is the same as comment 3 (of course need to replace the uuid, macaddr, vnc port for each vm).
2. (qemu) cont (do this nearly at the same time for the 5 vms.)
3. Watch the guests and found 3 of the 5 guests hangs and can not boot up successfully.
4. kill other 4 vms, and only leave 1 vm running ( a hang guest)
5. On host, 
#mount -t debugfs nodev /sys/kernel/debug
#cd /sys/kernel/debug/tracing 
#echo function > current_tracer
#echo 1 >  tracing_enabled
#cp trace /tmp/trace-qzhang.txt

Comment 18 Qunfang Zhang 2012-11-23 06:50:43 UTC

Created attachment 650223 [details]
ftrace log (kill other vm and only leave 1)

Comment 19 Andrew Jones 2012-11-23 13:40:02 UTC

Reproduced with the original environment and saw in kvm_stat that tons of interrupts were being injected. Gleb suggested using ftrace to figure out which interrupt and then qemu-monitor to figure with device that interrupt is mapped to.

ftrace showed thousands of kvm_set_irq for gsi 11

qemu-monitor showed

  Bus  0, device   3, function 0:
    Ethernet controller: PCI device 8086:100e
      IRQ 11.
      BAR0: 32 bit memory at 0xf2020000 [0xf203ffff].
      BAR1: I/O at 0xc040 [0xc07f].
      id "net0"

So it's the nic, which I neglected to look closely enough at
before, and now see that it's the e1000 model for the failing
configs. Changing that for my vm configs on my machine allows me to
easily reproduce it there too. Removing the <model type='e1000'/>
again, fixes it again. So the win2012 e1000 driver or the e1000
emulation seems to be the culprit. Considering win2k8-r2 guests work
on the same host with the same command line, then I guess it's most
likely the 2012 driver.

Comment 20 Andrew Jones 2012-11-23 15:06:25 UTC

I was thinking the next step should be to get MS involved (and maybe it still is), but I guess there's a chance that the driver is behaving correctly, and just trying to use more of the e1000 spec than is currently emulated (and more than is used by win2k8). I'll look at the traces some more to try and see if that's the case. I still wouldn't mind looking at a the windows dmp either if I can get access to it.

Comment 24 Michael S. Tsirkin 2012-11-27 16:13:04 UTC

to verify that this is not same issue as
https://bugzilla.redhat.com/show_bug.cgi?id=869095
(which we decided not to fix)
I created a bew build fixing above issue:
http://brewweb.devel.redhat.com/brew/taskinfo?taskID=5132461

can you please try with the above build?

Comment 25 Andrew Jones 2012-11-27 16:31:36 UTC

(In reply to comment #24)
> can you please try with the above build?

I sort of already tried it. I built a latest qemu-kvm (which has the revert of the revert for the interrupt disable bit emulation) and tried that, but the issue still reproduced.

Comment 28 Andrew Jones 2012-11-28 16:09:37 UTC

Couple new data points:

1) Yesterday the bug stopped reproducing for me while I messing around with it (turning on e1000 debug output, etc.). After putting everything back to the way it was before, it still didn't reproduce. I tried many times. Finally, I just tried booting the vms on the realtek driver, then shutting them down, switching the config to the e1000 and booting again. It reproduced on the first try.

2) mst asked me to check for pending interrupts on a vm exhibiting the 100% cpu load. I just checked now and don't see any. I have all zeros for 'info pic'.

Comment 30 Qunfang Zhang 2012-11-29 08:47:06 UTC

Update:
Re-test with the following package version, all can reproduce.

1) RHEL6.3 released kernel and qemu-kvm
kernel-2.6.32-279.el6.x86_64
qemu-kvm-0.12.1.2-2.295.el6.x86_64

2) mst's build in comment 24.
qemu-kvm-0.12.1.2-2.337.el6.mstfc5c7506b65b9c0beecc1cdc6c1c317185615c90.x86_64

3) Jason's build in comment 27:
qemu-kvm-0.12.1.2-2.337.el6bz874406.x86_64

CLI:
/usr/libexec/qemu-kvm -name ss4-1 -S -M rhel6.4.0 -enable-kvm -m 1024 -smp 1,sockets=1,cores=1,threads=1 -uuid eb28b9ee-894f-79c4-a64d-a883c435fd1f -nodefconfig -nodefaults -rtc base=utc -no-shutdown -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -drive file=/home/win2012-64-new.qcow2,if=none,id=drive-ide0-0-0,format=qcow2,cache=none -device ide-drive,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0,bootindex=1 -netdev tap,id=hostnet0 -device e1000,netdev=hostnet0,id=net0,mac=52:54:00:48:34:80,bus=pci.0,addr=0x3 -chardev pty,id=charserial0 -device isa-serial,chardev=charserial0,id=serial0 -vnc :0 -vga cirrus -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x4 -monitor stdio

Comment 32 Yvugenfi@redhat.com 2012-11-29 15:58:43 UTC

 (In reply to comment #19)
> Reproduced with the original environment and saw in kvm_stat that tons of
> interrupts were being injected. Gleb suggested using ftrace to figure out
> which interrupt and then qemu-monitor to figure with device that interrupt
> is mapped to.
> 
> ftrace showed thousands of kvm_set_irq for gsi 11
> 
> qemu-monitor showed
> 
>   Bus  0, device   3, function 0:
>     Ethernet controller: PCI device 8086:100e
>       IRQ 11.
>       BAR0: 32 bit memory at 0xf2020000 [0xf203ffff].
>       BAR1: I/O at 0xc040 [0xc07f].
>       id "net0"
> 
> So it's the nic, which I neglected to look closely enough at
> before, and now see that it's the e1000 model for the failing
> configs. Changing that for my vm configs on my machine allows me to
> easily reproduce it there too. Removing the <model type='e1000'/>
> again, fixes it again. So the win2012 e1000 driver or the e1000
> emulation seems to be the culprit. Considering win2k8-r2 guests work
> on the same host with the same command line, then I guess it's most
> likely the 2012 driver.

Andrew, did you saw other devices sharing the same IRQ?
On Windows virtio-serial and virtio-balloon will use legacy interrupts and not MSI.

Comment 33 Andrew Jones 2012-11-29 18:24:19 UTC

(In reply to comment #32)
> Andrew, did you saw other devices sharing the same IRQ?
> On Windows virtio-serial and virtio-balloon will use legacy interrupts and
> not MSI.

I just looked again and see that all my vms now use IRQ 10 for the nic. I double checked kvm_stat and trace-cmd and see that it's still the source of the issue, i.e. thousands of injected interrupts to gsi 10 instead of 11 now. So same problem, just different IRQ. And no other devices are sharing IRQ 10.

While poking in the trace I looked closer at the mmio activity, which kvm_stat also reports as super high. I see almost exactly the same number (many thousands) of the message below as I see interrupt injections to the nic.

mmio unsatisfied-read len 4 gpa 0xf20200c0 val 0x0

I turned more debug output on in the e1000 model

    DBGBIT(INTERRUPT) | DBGBIT(MMIO) | DBGBIT(UNKNOWN) | DBGBIT(GENERAL)

but I don't see anything interesting leading up to the interrupt storm.

Comment 38 Andrew Jones 2012-12-03 15:06:02 UTC

(In reply to comment #33)
> mmio unsatisfied-read len 4 gpa 0xf20200c0 val 0x0

qemu logs say the base addr is 0xf2020000, and the e1000 manual says c0 is the interrupt cause read register (ICR).

I tried looking at the dump, but it was too slow to do anything over the vpn with windbg running in a windows vm. I did see it was in E1G6032E!E1000ISR+101 at the time, which isn't a big surprise.

Comment 44 Michael S. Tsirkin 2012-12-05 11:25:16 UTC

Note that comment 25 says issue reproduced on upstream
so cherry-picking upstream commits should not help.

Comment 47 Andrew Jones 2012-12-06 14:17:34 UTC

Reproduces on a guest using the e1000 model, even when all other vms are using the realtek model. I still haven't seen it without [re]booting multiple vms simultaneously though. So, maybe a scheduling sensitive issue during the e1000 init? Next thing to try is to just create a few network-busy vms, and then see if the e1000 vm still hits the issue on boot.

Comment 48 Michael S. Tsirkin 2012-12-06 14:30:19 UTC

i would see if anything appears by enabling debug in code.
looks like debugflags = 0xffff should do it.


Also maybe add printf in e1000_mmio_writel and readl
and see what does guest do.

Comment 49 Andrew Jones 2012-12-07 10:20:26 UTC

Created attachment 659288 [details]
e1000 logs, one normal set and one set with the problem

Comment 50 Andrew Jones 2012-12-07 11:00:41 UTC

(In reply to comment #48)
> i would see if anything appears by enabling debug in code.
> looks like debugflags = 0xffff should do it.
> 

Yeah, I probably should have just started with 0xffff, rather than trying be selective with the debug (comment 33). I've now attached logs, one for a normal boot and one for the same vm when it hits the problem. Looking through it appears that most of the differences are simply ordering of events, rather than different events. What stands out though is that in the failing case it appears to initialize twice (the eeprom gets read twice).

Comment 51 Andrew Jones 2012-12-11 16:34:07 UTC

I've been able to reproduce this without other VMs involved. I just booted a single win2012 vm that had a single vcpu and then went into device manager and did a series of e1000 driver uninstalls followed by scans for hardware changes - which auto-installs it back again. Once I was able to reproduce the issue in only two rounds. I captured all logs in all cases, but I still don't see anything too suspicious in them.

Comment 52 Michael S. Tsirkin 2012-12-11 16:52:24 UTC

looking at it - I kind of expected to see guest doing
some thing in a loop again and again but this does not
apppear to be the case?

Comment 53 Andrew Jones 2012-12-11 21:10:04 UTC

(In reply to comment #52)
> looking at it - I kind of expected to see guest doing
> some thing in a loop again and again but this does not
> apppear to be the case?

Nope, desn't appear to. The only "loop" is the repeated sequence of interrupt related logs. I decided to go back to analysing ftrace logs instead, but unfortunately haven't found anything suspicious in those either. I was able to reproduce the 100% vcpu issue while uninstalling the driver with device manager now as well. So I guess the issue isn't on the init path.

Comment 54 Michael S. Tsirkin 2012-12-11 21:13:11 UTC

>  The only "loop" is the repeated sequence of interrupt related logs.
but it does not repeat forever right?

Comment 55 Andrew Jones 2012-12-11 21:25:31 UTC

(In reply to comment #54)
> >  The only "loop" is the repeated sequence of interrupt related logs.
> but it does not repeat forever right?

It does. The log just fills with it.

Comment 56 Andrew Jones 2012-12-11 21:26:11 UTC

Ah, one other note. disabling/enabling the nic over and over didn't reproduce the issue (at least not in the 30 or so times I tried). So if the bug appears on the uninstall path, then maybe it's occurring after the nic is already disabled? I'll try to test some more to confirm.

Comment 58 Andrew Jones 2012-12-12 19:35:27 UTC

Got it to reproduce with a powershell script doing disable/enable

while (1) {
   netsh interface set interface "Ethernet 2" disabled 
   netsh interface set interface "Ethernet 2" enabled
   sleep 1
}

It reproduced fairly quickly once, but then another time it ran until I stopped waiting and killed it. A more reliable way to reproduce is by removing the device and then rescanning for it with devcon. I wrote a script that does that, and it also synchronizes the starting/stopping of tracing - in order to try to reduce the traces we have to look at. I'll attach the traces and scripts now. The traces are much smaller, but I still don't see any glaring log. I haven't looked much at them yet though.

Comment 59 Andrew Jones 2012-12-12 19:36:21 UTC

Created attachment 662579 [details]
reproducer script with trace sync

Comment 60 Andrew Jones 2012-12-12 19:37:09 UTC

Created attachment 662580 [details]
traces (ftrace and qemu log) made with trace sync

Comment 61 Andrew Jones 2012-12-12 19:38:35 UTC

Created attachment 662581 [details]
setupapi.dev.log for the same failure as traces.tgz

Comment 62 Andrew Jones 2012-12-18 18:03:24 UTC

Some clues

The qemu logs show that the driver attempts to generate a TXQE interrupt after it has set the interrupt mask to 0. This should ok, and the TXQE bit just stays set in the ICR until an allowed interrupt type occurs, then the ICR is read, clearing the whole thing. So it's not clear yet how this might cause a problem, but the one line hack below appears to stop the interrupt storm.

diff --git a/hw/e1000.c b/hw/e1000.c
index ec3a7c4..70b8b57 100644
--- a/hw/e1000.c
+++ b/hw/e1000.c
@@ -237,6 +237,7 @@ set_ics(E1000State *s, int index, uint32_t val)
 {
     DBGOUT(INTERRUPT, "set_ics %x, ICR %x, IMR %x\n", val, s->mac_reg[ICR],
         s->mac_reg[IMS]);
+    val &= s->mac_reg[IMS];
     set_interrupt_cause(s, 0, val | s->mac_reg[ICR]);
 } 


The other clue we still have is that the interrupt storm seems to only reproduce on device enable/disable or uninstall/rescan. This seems consistent with the known issue where interrupts can arrive before the handler is registered, keeping the cpu too busy to ever finish registering the handler. That issue should be addressed for the device install case by Jason Wang with "b9d03e35: e1000: link auto-negotiation emulation", however it may still exist for the other three cases.

Comment 63 Andrew Jones 2012-12-19 17:24:33 UTC

Actually in the end the TXQE type interrupt
was a red herring, but one that seems to have led me in the
right direction. I've changed the hack to be more specific
and now have narrowed it down to a set_ics that comes from
the driver (comes from macreg_writeops, not a model function)
and to only change val if IMS = 0 (sets val = 0). This
still "fixes" things. Watching trace I see that all set_ics
calls from the driver are for interrupt type 0x80, which
is RXT0 (rx timer intr). During controller init we usually
only get these ICS writes with IMS non-zero, and so the
hack allows them, and we still don't break. However, during
normal operation (e.g. running iexplorer) we get tons more of
the ICS writes, and IMS is never non-zero, so the hack
"disallows" them all - probably unnecessarily, as the
problem seems to only manifest itself during enable/disable.
Thus, it's still a hack.

So it seems the driver has an RX timer that wants to
periodically generate interrupts, but it has all
controller interrupt types masked at the time, likely
on purpose, wanting it to come later for the normal
interrupt handler to manage. However, on enable/disable
that interrupt may come before/after the handler is
ready. I wonder what I'd see looking for the same thing
on a driver for e.g. win2k8?

Comment 65 Andrew Jones 2012-12-20 14:20:19 UTC

The bug easily reproduces with win2k8, using whatever version of the e1000 driver comes with it. I simply booted a win2k8r2x64 server and did a few enable/disables with netsh and then bam, interrupt storm. The trace of the RXT0 set_ics calls match win2012.

Next I watched trace for a Linux guest (3.3.4-5.fc17.x86_64). There were no RXT0 set_ics calls. In place of those is a constant flow of RXDMT0 (rx desc min. threshold) set_ics calls [1]. All of them are called with a non-zero IMS, which means the hack would always allow them, i.e. it does nothing, so there's no need for the hack.

I think the next thing to do is to get Intel and MS involved in order to try and see if my hypothesis of a race between the receiver timer and the [un]registering of the E1000 interrupt handler holds water. If so, then they should really fix that. I don't know of any clean way to hack the model for this. The hack I have, which I'll attach for completeness now, could possibly break other drivers - although it doesn't look like it would break Linux.

[1] Footnote: Maybe the Linux driver should bump up the RCTL.RDMTS value it uses to try and avoid the frequency that RXDMT0 gets generated?

Comment 66 Andrew Jones 2012-12-20 14:22:08 UTC

Created attachment 666693 [details]
hack to deal with RXT0 set_ics calls that have IMS=0

Comment 68 Michael S. Tsirkin 2012-12-25 12:01:53 UTC

To comment 65: I agree let's get intel involved.

Comment 69 Michael S. Tsirkin 2012-12-25 13:27:36 UTC

Maybe a cleaner work-around is to ignore (mask) RXT0 interrupts
even if the bit in IMS is set as long as as receive descriptor ring
has been set up.
Let's try?

Comment 70 Michael S. Tsirkin 2012-12-25 13:44:59 UTC

Created attachment 668921 [details]
this is a simple hack: after reset, disable rxt0 interrupts until RDT is set

Basically real hardware doesn't send receive interrupts
until RDH/RDT are programmed, so maybe this also
applies to receive timer interrupt even if it
was set by the driver?
Strange, but let's see:
this hack is untested and needs some polish anyway,
but I wonder whether it changes anything.

Could you try?

I think we should talk to Intel anyway.

Comment 71 Yvugenfi@redhat.com 2012-12-27 16:24:00 UTC

(In reply to comment #58)
> Got it to reproduce with a powershell script doing disable/enable
> 
> while (1) {
>    netsh interface set interface "Ethernet 2" disabled 
>    netsh interface set interface "Ethernet 2" enabled
>    sleep 1
> }
> 
> It reproduced fairly quickly once, but then another time it ran until I
> stopped waiting and killed it. A more reliable way to reproduce is by
> removing the device and then rescanning for it with devcon. I wrote a script
> that does that, and it also synchronizes the starting/stopping of tracing -
> in order to try to reduce the traces we have to look at. I'll attach the
> traces and scripts now. The traces are much smaller, but I still don't see
> any glaring log. I haven't looked much at them yet though.

I run additional script twice during enable\disable script just to keep CPU busy. 100% reproduction rate in less than a minute using one VM on the host.

while (1) {
    print "something"
 }

Comment 72 Yvugenfi@redhat.com 2012-12-27 17:08:24 UTC

(In reply to comment #62)
 
> diff --git a/hw/e1000.c b/hw/e1000.c
> index ec3a7c4..70b8b57 100644
> --- a/hw/e1000.c
> +++ b/hw/e1000.c
> @@ -237,6 +237,7 @@ set_ics(E1000State *s, int index, uint32_t val)
>  {
>      DBGOUT(INTERRUPT, "set_ics %x, ICR %x, IMR %x\n", val, s->mac_reg[ICR],
>          s->mac_reg[IMS]);
> +    val &= s->mac_reg[IMS];
>      set_interrupt_cause(s, 0, val | s->mac_reg[ICR]);
>  } 
> 

I end up with blue screen (exception in e1000 driver) using this patch after 5-10 minutes of disable\enable cycles.

Comment 73 Michael S. Tsirkin 2012-12-27 20:37:05 UTC

yan what about one from comment 70?

Comment 74 Yvugenfi@redhat.com 2012-12-30 09:13:28 UTC

(In reply to comment #73)
> yan what about one from comment 70?

Tried that and variation of looking on "(s->mac_reg[RCTL] & E1000_RCTL_EN" as well and it didn't helped.

Comment 75 Yvugenfi@redhat.com 2012-12-30 09:15:35 UTC

(In reply to comment #73)
> yan what about one from comment 70?

Tried that and variation of looking on "(s->mac_reg[RCTL] & E1000_RCTL_EN" as well and it didn't helped.

Comment 78 Andrew Jones 2013-01-09 15:24:38 UTC

Created attachment 675644 [details]
Possible fix

I started writing a note to Intel. While summarizing the situation it occurred to me that it doesn't make much sense to save writes to ICS (=> ICR) that are masked by IMS. It's not spelled out to do so or not to do so in the manual though. This attached patch disappears those writes and also appears to fix the issue (enable/disable loop running now). Assuming real hardware works this way as well, which I'll try to test, then this patch would be a real fix rather than another hack.

Comment 79 Yvugenfi@redhat.com 2013-01-09 15:46:49 UTC

(In reply to comment #78)
> Created attachment 675644 [details]
> Possible fix
> 
> I started writing a note to Intel. While summarizing the situation it
> occurred to me that it doesn't make much sense to save writes to ICS (=>
> ICR) that are masked by IMS. It's not spelled out to do so or not to do so
> in the manual though. This attached patch disappears those writes and also
> appears to fix the issue (enable/disable loop running now). Assuming real
> hardware works this way as well, which I'll try to test, then this patch
> would be a real fix rather than another hack.

TXQE that was set from QEMU just before interrupt storm although there was no mask for it looked very suspicious. 

In any case - on my setup the guest with this patch is alive considerably longer time that was needed for the reproduction of the hang. Also there is no side effect of BSOD caused by e1000 driver after a while as with patch from comment #62.

Comment 83 Andrew Jones 2013-01-10 13:43:59 UTC

I left my enable/disable loop running all night and Windows eventually crashed.
 
bugcheck 7e - system thread exception not handled

stack:
E1G6032E!AllowSmbusAccess+0x10
E1G6032E!PeriodicWorkItem+0x80
nt!IopProcessWorkItem+0x5f
nt!ExpWorkerThread+0x142
nt!PspSystemThreadStartup+0x59
nt!KiStartSystemThread+0x16

FAULTING_IP:
E1G6032E!AllowSmbusAccess+10                mov     ecx,dword ptr [rax+5820h]

rax is null, which comes from offset 0 of rcx, and appears to be some structure handed to E1G6032E!PeriodicWorkItem. The byte at offset d70 of this structure is used to determine if the pointer at offset 0 should be dereferenced (see disassembly below). If it's non-zero, as in this case (rcx+d70 == 1), it attempts to use rax and crashes.

E1G6032E!AllowSmbusAccess:
fffff880`04420d04 32d2            xor     dl,dl
fffff880`04420d06 3891700d0000    cmp     byte ptr [rcx+0D70h],dl
fffff880`04420d0c 7503            jne     E1G6032E!AllowSmbusAccess+0xd (fffff880`04420d11)

E1G6032E!AllowSmbusAccess+0xa:
fffff880`04420d0e 32c0            xor     al,al
fffff880`04420d10 c3              ret

E1G6032E!AllowSmbusAccess+0xd:
fffff880`04420d11 488b01          mov     rax,qword ptr [rcx]
fffff880`04420d14 8b8820580000    mov     ecx,dword ptr [rax+5820h]

There's some circumstantial evidence connecting interrupt (not) enabling to this as well. The disassembly of E1G6032E!PeriodicWorkItem shows that just above the call to E1G6032E!AllowSmbusAccess there's a potential call to E1G6032E!E1000EnableInterrupt. However there are several conditions that must be met, otherwise that call is skipped. In this case it was skipped.

I tried to further reverse engineer this driver, but without a good disassembler, it's driving me crazy. In any case my fix in comment 78 doesn't appear to be fix.

Comment 85 Mike Cao 2013-01-11 02:37:54 UTC

(In reply to comment #78)
> Created attachment 675644 [details]
> Possible fix
> 
> I started writing a note to Intel. While summarizing the situation it
> occurred to me that it doesn't make much sense to save writes to ICS (=>
> ICR) that are masked by IMS. It's not spelled out to do so or not to do so
> in the manual though. This attached patch disappears those writes and also
> appears to fix the issue (enable/disable loop running now). Assuming real
> hardware works this way as well, which I'll try to test, then this patch
> would be a real fix rather than another hack.

Hi, Andrew 

Could you make stratch build for QE to test ?

I don't undertand what's the relationship between guest stuck at 100% during reboot and disable/enable in a loop issue ,Do you mean they are the exactly same issue ?

Thanks,
Mike

Comment 86 Andrew Jones 2013-01-11 08:35:55 UTC

(In reply to comment #85)
> Could you make stratch build for QE to test ?

See comment 83; I'm not happy with the fix from comment 78 any more.

> 
> I don't undertand what's the relationship between guest stuck at 100% during
> reboot and disable/enable in a loop issue ,Do you mean they are the exactly
> same issue ?
> 

Yes, it's the same issue. I was able to reduce the reproducer to something much simpler and faster.

Comment 87 Mike Cao 2013-01-11 08:46:44 UTC

(In reply to comment #86)
> (In reply to comment #85)
> > Could you make stratch build for QE to test ?
> 
> See comment 83; I'm not happy with the fix from comment 78 any more.
> 
> > 
> > I don't undertand what's the relationship between guest stuck at 100% during
> > reboot and disable/enable in a loop issue ,Do you mean they are the exactly
> > same issue ?
> > 
> 
> Yes, it's the same issue. I was able to reduce the reproducer to something
> much simpler and faster.

Could you provide me strach build first ? I still want to check whether guest will hang during reboot or not 

Thanks,
Mike

Comment 88 jason wang 2013-01-11 09:10:40 UTC

FYI, I can reproduce this issue by:

1) Flood ping from local host to guest
2) Disable the driver in guest

Looks like the bug could be just reproduced then. 
Maybe we can ask help from qemu upstream?

Comment 89 Andrew Jones 2013-01-11 14:32:26 UTC

(In reply to comment #87)
> Could you provide me strach build first ? I still want to check whether
> guest will hang during reboot or not 

It wouldn't be worth it. The reboot test would likely be fine, and maybe we'd even pass cert, but we know there's still an issue that we haven't completely understood yet. So the patch is still in progress.

mst pointed out a pattern he saw in the qemu traces, indicating the guest is just trying to preserve the RXT0 bits it sees in ICR by setting them in ICS. I attempted to confirm/deny that yesterday, and think I denied it. I see ICS settings initiated by the guest (i.e. initiated through mmio) for RXT0 at different times whether ICR has the bit set or not. I'm still improving the tracing output to try and better understand what happens though.

I like Jason's suggestion about getting a thread started upstream. We can try to get some Intel people on CC as well. I'll send a mail soon.

Comment 90 jason wang 2013-04-26 08:29:47 UTC

*** Bug 956702 has been marked as a duplicate of this bug. ***

Comment 91 Yvugenfi@redhat.com 2013-08-26 15:32:39 UTC

*** Bug 990053 has been marked as a duplicate of this bug. ***

Comment 92 Andrew Jones 2013-08-26 16:34:10 UTC

This bug may not be e1000 specific. See bug 999296. And then again, it may... we don't know the root cause yet of either bug.