Bug 607510 - Windows7 guest cannot resume after suspended to disk after plenty of pause:resume iterations - e1000
Windows7 guest cannot resume after suspended to disk after plenty of pause:re...
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: qemu-kvm (Show other bugs)
6.0
All Linux
high Severity medium
: beta
: 6.2
Assigned To: jason wang
Virtualization Bugs
:
: 716804 (view as bug list)
Depends On:
Blocks: 580953 720669 753024 761491 847241 884998
  Show dependency treegraph
 
Reported: 2010-06-24 05:47 EDT by Cao, Chen
Modified: 2013-02-21 02:29 EST (History)
15 users (show)

See Also:
Fixed In Version: qemu-kvm-0.12.1.2-2.306.el6
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 847241 884998 (view as bug list)
Environment:
Last Closed: 2013-02-21 02:29:58 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
stuck on the intel host (279.60 KB, image/png)
2011-01-04 20:50 EST, Cao, Chen
no flags Details
stuck on the amd host (314.29 KB, image/png)
2011-01-04 20:50 EST, Cao, Chen
no flags Details
ftrace for kvm when Windows7 guest stuck while resuming from S4 (4.69 MB, text/plain)
2011-01-05 20:59 EST, Cao, Chen
no flags Details


External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2013:0527 normal SHIPPED_LIVE qemu-kvm bug fix and enhancement update 2013-02-20 16:51:08 EST

  None (edit)
Description Cao, Chen 2010-06-24 05:47:31 EDT
Description of problem:
just stuck while resuming.


Version-Release number of selected component (if applicable):
qemu-kvm-0.12.1.2-2.80.el6.x86_64

How reproducible:
90%


Steps to Reproduce:
1. start vm with command
qemu-kvm -name 'vm8' -monitor tcp:0:6001,server,nowait -drive file=./win7-32.qcow2,if=ide,cache=none,aio=native -net nic,vlan=0,model=rtl8139,macaddr=02:bE:aF:7B:7b:6c -net tap,vlan=0,ifname=rtl8139_0_6001,script=/root/qemu-ifup-switch,downscript=no,vhost=on -m 2048 -smp 1  -soundhw ac97 -redir tcp:5000::22 -vnc :0 -spice port=8000,disable-ticketing -usbdevice tablet -rtc-td-hack -cpu qemu64,+sse2 -no-kvm-pit-reinjection -serial unix:/tmp/serial-20100623-102949-ud1H,server,nowait

2. suspend guest to disk by command
rundll32.exe PowrProf.dll, SetSuspendState

3. after suspended, start the vm again with the same command as in step 1.
  

Actual results:
guest becomes unresponsive while resuming.


Expected results:
guest resumed good, with all states the same as that before suspending.


Additional info:
1.
# uname -r 
2.6.32-37.el6.x86_64

# rpm -qa |grep qemu
qemu-img-0.12.1.2-2.80.el6.x86_64
qemu-kvm-0.12.1.2-2.80.el6.x86_64
qemu-kvm-tools-0.12.1.2-2.80.el6.x86_64
gpxe-roms-qemu-0.9.7-6.3.el6.noarch
qemu-kvm-debuginfo-0.12.1.2-2.75.el6.x86_64

# cat /proc/cpuinfo
processor	: 3
vendor_id	: GenuineIntel
cpu family	: 6
model		: 23
model name	: Intel(R) Core(TM)2 Quad CPU    Q9400  @ 2.66GHz
stepping	: 10
cpu MHz		: 2660.016
cache size	: 3072 KB
physical id	: 0
siblings	: 4
core id		: 3
cpu cores	: 4
apicid		: 3
initial apicid	: 3
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good aperfmperf pni dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm sse4_1 xsave lahf_lm tpr_shadow vnmi flexpriority
bogomips	: 5319.74
clflush size	: 64
cache_alignment	: 64
address sizes	: 36 bits physical, 48 bits virtual
power management:

2.
I have tried on rhel5, with the same image, suspend/resume good.

# uname -r
2.6.18-194.3.1.el5

# rpm -qa |grep kvm
kvm-qemu-img-83-164.el5_5.9
etherboot-zroms-kvm-5.4.4-13.el5
etherboot-roms-kvm-5.4.4-13.el5


# cat /proc/cpuinfo
processor	: 3
vendor_id	: GenuineIntel
cpu family	: 6
model		: 23
model name	: Intel(R) Core(TM)2 Quad CPU    Q9400  @ 2.66GHz
stepping	: 10
cpu MHz		: 2667.000
cache size	: 3072 KB
physical id	: 0
siblings	: 4
core id		: 3
cpu cores	: 4
apicid		: 3
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx lm constant_tsc pni monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr sse4_1 lahf_lm
bogomips	: 5320.01
clflush size	: 64
cache_alignment	: 64
address sizes	: 36 bits physical, 48 bits virtual
power management:
Comment 2 RHEL Product and Program Management 2010-06-24 06:12:55 EDT
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux major release.  Product Management has requested further
review of this request by Red Hat Engineering, for potential inclusion in a Red
Hat Enterprise Linux Major release.  This request is not yet committed for
inclusion.
Comment 3 Gleb Natapov 2010-06-29 09:34:38 EDT
Is it reproducable  without spice?
Comment 4 Cao, Chen 2010-06-29 22:55:21 EDT
(In reply to comment #3)
> Is it reproducable  without spice?    

I can only reproduce this bug without spice,

the Windows7 guest can suspend/resume with the -spice option.

and I have also tried it on
# rpm -q qemu-kvm
qemu-kvm-0.12.1.2-2.90.el6.x86_64

# uname -r
2.6.32-39.el6.x86_64
Comment 5 RHEL Product and Program Management 2010-07-15 10:04:06 EDT
This issue has been proposed when we are only considering blocker
issues in the current Red Hat Enterprise Linux release. It has
been denied for the current Red Hat Enterprise Linux release.

** If you would still like this issue considered for the current
release, ask your support representative to file as a blocker on
your behalf. Otherwise ask that it be considered for the next
Red Hat Enterprise Linux release. **
Comment 7 Gleb Natapov 2010-11-04 14:45:52 EDT
Is this still reproducable?
Comment 8 Cao, Chen 2010-11-29 05:05:41 EST
(In reply to comment #7)
> Is this still reproducable?

yes, still can reproduce, once out of 6 times.

with command:
/usr/libexec/qemu-kvm -name 'vm1' \
-chardev socket,id=human_monitor_eqFd,path=/tmp/monitor-humanmonitor1-20101124-114524-dMO8,server,nowait \
-mon chardev=human_monitor_eqFd,mode=readline \
-chardev socket,id=serial_i2zu,path=/tmp/serial-20101124-114524-dMO8,server,nowait \
-device isa-serial,chardev=serial_i2zu \
-drive file='./win7-32.qcow2',index=0,if=none,id=drive-ide0-0-0,media=disk,cache=writethrough,format=qcow2,aio=native \
-device ide-drive,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0 \
-device rtl8139,mac=9a:3e:52:90:38:8e,netdev=idtaTtKT,id=ndev00idtaTtKT,bus=pci.0,addr=0x3 \
-netdev tap,id=idtaTtKT,ifname='t0-114524-dMO8',script='./qemu-ifup-switch',downscript='no' -m 2048 -smp 1 \
-cpu cpu64-rhel6,+sse2,+x2apic \
-vnc :0 \
-rtc base=utc,clock=host,driftfix=none \
-M rhel6.0.0 -usbdevice tablet -no-kvm-pit-reinjection -enable-kvm


and,
# uname -r
2.6.32-71.9.1.el6.x86_64

# rpm -q qemu-kvm
qemu-kvm-0.12.1.2-2.113.el6_0.4.x86_64


(qemu) info pci
  Bus  0, device   0, function 0:
    Host bridge: PCI device 8086:1237
      id ""
  Bus  0, device   1, function 0:
    ISA bridge: PCI device 8086:7000
      id ""
  Bus  0, device   1, function 1:
    IDE controller: PCI device 8086:7010
      BAR4: I/O at 0xc000 [0xc00f].
      id ""
  Bus  0, device   1, function 2:
    USB controller: PCI device 8086:7020
      IRQ 11.
      BAR4: I/O at 0xc020 [0xc03f].
      id ""
  Bus  0, device   1, function 3:
    Bridge: PCI device 8086:7113
      IRQ 9.
      id ""
  Bus  0, device   2, function 0:
    VGA controller: PCI device 1013:00b8
      BAR0: 32 bit prefetchable memory at 0xf0000000 [0xf1ffffff].
      BAR1: 32 bit memory at 0xf2000000 [0xf2000fff].
      BAR6: 32 bit memory at 0xffffffffffffffff [0x0000fffe].
      id ""
  Bus  0, device   3, function 0:
    Ethernet controller: PCI device 10ec:8139
      IRQ 10.
      BAR0: I/O at 0xc100 [0xc1ff].
      BAR1: 32 bit memory at 0xf2020000 [0xf20200ff].
      BAR6: 32 bit memory at 0xffffffffffffffff [0x0000fffe].
      id "ndev00idtaTtKT"
Comment 9 Gleb Natapov 2011-01-03 06:29:03 EST
Cannot reproduce. At what point it hangs? What "info cpus" and "info registers" in qemu monitor show when it stuck?
Comment 10 Cao, Chen 2011-01-04 04:40:58 EST
(In reply to comment #9)
> Cannot reproduce. At what point it hangs? What "info cpus" and "info registers"
> in qemu monitor show when it stuck?

1.
Reproduced once out of 10 on the intel host, and got a very high rate to
reproduce it on the amd host specified below.

2.
the Windows7 guest is stuck when preparing the login (unlock) screen.
the screenshot is attached.


on
# rpm -q qemu-kvm
qemu-kvm-0.12.1.2-2.113.el6_0.5.x86_64

# uname -r
2.6.32-71.13.1.el6.x86_64


3.
info got on the intel host:
---
(qemu) info cpus
* CPU #0: pc=0x0000000082a1fcac thread_id=9440

(qemu) info registers
EAX=00000009 EBX=8078ad6c ECX=82732c09 EDX=000000a2
ESI=854bfd38 EDI=854b1c80 EBP=8078ace8 ESP=8078ace0
EIP=82a1fcac EFL=00010046 [---Z-P-] CPL=0 II=0 A20=1 SMM=0 HLT=0
ES =0023 00000000 ffffffff 00c0f300 DPL=3 DS   [-WA]
CS =0008 00000000 ffffffff 00c09b00 DPL=0 CS32 [-RA]
SS =0010 00000000 ffffffff 00c09300 DPL=0 DS   [-WA]
DS =0023 00000000 ffffffff 00c0f300 DPL=3 DS   [-WA]
FS =0030 82732c00 00003748 00409300 DPL=0 DS   [-WA]
GS =0000 00000000 ffffffff 00000000
LDT=0000 00000000 ffffffff 00000000
TR =0028 801da000 000020ab 00008b00 DPL=0 TSS32-busy
GDT=     80b95000 000003ff
IDT=     80b95400 000007ff
CR0=80010031 CR2=00540000 CR3=00185000 CR4=000006f8
DR0=00000000 DR1=00000000 DR2=00000000 DR3=00000000
DR6=ffff0ff0 DR7=00000400
FCW=027f FSW=0000 [ST=0] FTW=00 MXCSR=00000000
FPR0=0000000000000000 0000 FPR1=0000000000000000 0000
FPR2=0000000000000000 0000 FPR3=0000000000000000 0000
FPR4=0000000000000000 0000 FPR5=0000000000000000 0000
FPR6=0000000000000000 0000 FPR7=0000000000000000 0000
XMM00=00000000000000000000000000000000 XMM01=00000000000000000000000000000000
XMM02=00000000000000000000000000000000 XMM03=00000000000000000000000000000000
XMM04=00000000000000000000000000000000 XMM05=00000000000000000000000000000000
XMM06=00000000000000000000000000000000 XMM07=00000000000000000000000000000000


info got on the amd host:
---
(qemu) info cpus
* CPU #0: pc=0x000000008260dcac thread_id=7402

(qemu) info registers
EAX=00000009 EBX=8078ad6c ECX=82767c09 EDX=000000a2
ESI=8543bd38 EDI=85428c80 EBP=8078ace8 ESP=8078ace0
EIP=8260dcac EFL=00000046 [---Z-P-] CPL=0 II=0 A20=1 SMM=0 HLT=0
ES =0023 00000000 ffffffff 00c0f300 DPL=3 DS   [-WA]
CS =0008 00000000 ffffffff 00c09b00 DPL=0 CS32 [-RA]
SS =0010 00000000 ffffffff 00c09300 DPL=0 DS   [-WA]
DS =0023 00000000 ffffffff 00c0f300 DPL=3 DS   [-WA]
FS =0030 82767c00 00003748 00409300 DPL=0 DS   [-WA]
GS =0000 00023de0 0000ffff 00000000
LDT=0000 00000000 0000ffff 00000000
TR =0028 801da000 000020ab 00008b00 DPL=0 TSS32-busy
GDT=     80b95000 000003ff
IDT=     80b95400 000007ff
CR0=8001003b CR2=02f90ffc CR3=7f1231a0 CR4=000006f8
DR0=00000000 DR1=00000000 DR2=00000000 DR3=00000000
DR6=ffff0ff0 DR7=00000400
FCW=027f FSW=0120 [ST=0] FTW=00 MXCSR=00000000
FPR0=0000000000000000 0000 FPR1=0000000000000000 0000
FPR2=0000000000000000 0000 FPR3=0000000000000000 0000
FPR4=8000000000000000 3fff FPR5=c0fd200000000000 4002
FPR6=f000000000000000 4002 FPR7=8000000000000000 3fff
XMM00=00000000000000000000000000000000 XMM01=00430034003400310034003600420035
XMM02=002e0031005f00460044003100460043 XMM03=0031002e0030003000360037002e0031
XMM04=004e004f004e005f0035003800330036 XMM05=004300370043004600320037005f0045
XMM06=00350032003200310036003800460042 XMM07=004c0050004900440047005c00410043



4.
reproduced on the following hosts:

intel:
---
processor       : 3
vendor_id       : GenuineIntel
cpu family      : 6
model           : 23
model name      : Intel(R) Core(TM)2 Quad CPU    Q9400  @ 2.66GHz
stepping        : 10
cpu MHz         : 2660.132
cache size      : 3072 KB
physical id     : 0
siblings        : 4
core id         : 3
cpu cores       : 4
apicid          : 3
initial apicid  : 3
fpu             : yes
fpu_exception   : yes
cpuid level     : 13
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm             constant_tsc arch_perfmon pebs bts rep_good aperfmperf pni dtes64 monitor       ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm sse4_1 xsave lahf_lm tpr_shadow     vnmi flexpriority
bogomips        : 5319.73
clflush size    : 64
cache_alignment : 64
address sizes   : 36 bits physical, 48 bits virtual
power management:

amd:
---
processor       : 3
vendor_id       : AuthenticAMD
cpu family      : 16
model           : 2
model name      : AMD Phenom(tm) 9600B Quad-Core Processor
stepping        : 3
cpu MHz         : 1150.000
cache size      : 512 KB
physical id     : 0
siblings        : 4
core id         : 3
cpu cores       : 4
apicid          : 3
initial apicid  : 3
fpu             : yes
fpu_exception   : yes
cpuid level     : 5
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc rep_good nonstop_tsc extd_apicid pni monitor cx16   popcnt lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse          3dnowprefetch osvw ibs npt lbrv svm_lock
bogomips        : 4587.43
TLB size        : 1024 4K pages
clflush size    : 64
cache_alignment : 64
address sizes   : 48 bits physical, 48 bits virtual
power management: ts ttp tm stc 100mhzsteps hwpstate
Comment 11 Gleb Natapov 2011-01-04 04:59:13 EST
There is no screenshot attached. Can you attache it please. Also when stuck issue "x/50i $pc-50" + "info registers".
Comment 12 Cao, Chen 2011-01-04 20:49:39 EST
(In reply to comment #11)
> There is no screenshot attached. Can you attache it please. Also when stuck
> issue "x/50i $pc-50" + "info registers".

(qemu) x/50i $pc-50
0x0000000082a1fc7a:  or     $0xc1,%al
0x0000000082a1fc7c:  call   0x7ae8583
0x0000000082a1fc81:  add    %al,(%eax) 
0x0000000082a1fc83:  (bad)
0x0000000082a1fc84:  lcall  *-0x3e(%ebp) 
0x0000000082a1fc87:  or     %al,(%eax)
0x0000000082a1fc89:  int3
0x0000000082a1fc8a:  int3
0x0000000082a1fc8b:  int3
0x0000000082a1fc8c:  int3
0x0000000082a1fc8d:  int3
0x0000000082a1fc8e:  mov    %edi,%edi
0x0000000082a1fc90:  mov    0xfffe0300,%eax
0x0000000082a1fc95:  test   $0x1000,%eax
0x0000000082a1fc9a:  jne    0x82a1fc90
0x0000000082a1fc9c:  ret
0x0000000082a1fc9d:  int3
0x0000000082a1fc9e:  int3
0x0000000082a1fc9f:  int3
0x0000000082a1fca0:  int3
0x0000000082a1fca1:  int3
0x0000000082a1fca2:  movl   $0x0,0xfffe00b0
0x0000000082a1fcac:  ret
0x0000000082a1fcad:  int3
0x0000000082a1fcae:  int3
0x0000000082a1fcaf:  int3
0x0000000082a1fcb0:  int3
0x0000000082a1fcb1:  int3
0x0000000082a1fca2:  movl   $0x0,0xfffe00b0
0x0000000082a1fcac:  ret
0x0000000082a1fcad:  int3
0x0000000082a1fcae:  int3
0x0000000082a1fcaf:  int3
0x0000000082a1fcb0:  int3
0x0000000082a1fcb1:  int3
0x0000000082a1fcb2:  mov    %edi,%edi
0x0000000082a1fcb4:  push   %ebp
0x0000000082a1fcb5:  mov    %esp,%ebp
0x0000000082a1fcb7:  cmpb   $0x0,0x82a36182
0x0000000082a1fcbe:  mov    0x8(%ebp),%eax
0x0000000082a1fcc1:  jne    0x82a1fcc6
0x0000000082a1fcc3:  shl    $0x18,%eax
0x0000000082a1fcc6:  mov    0xc(%ebp),%edx
0x0000000082a1fcc9:  push   %esi
0x0000000082a1fcca:  xor    %esi,%esi
0x0000000082a1fccc:  xor    %ecx,%ecx
0x0000000082a1fcce:  or     %esi,%eax
0x0000000082a1fcd0:  or     %edx,%ecx
0x0000000082a1fcd2:  push   %eax
0x0000000082a1fcd3:  push   %ecx
0x0000000082a1fcd4:  call   *0x82a361a8
0x0000000082a1fcda:  pop    %esi
0x0000000082a1fcdb:  pop    %ebp
0x0000000082a1fcdc:  ret    $0x8
0x0000000082a1fcdf:  int3
0x0000000082a1fce0:  int3
0x0000000082a1fce1:  int3


(qemu) info registers
EAX=00000009 EBX=8078ad6c ECX=82732c09 EDX=000000a2
ESI=854bfd38 EDI=854b1c80 EBP=8078ace8 ESP=8078ace4
EIP=82a1ec86 EFL=00000046 [---Z-P-] CPL=0 II=0 A20=1 SMM=0 HLT=0
ES =0023 00000000 ffffffff 00c0f300 DPL=3 DS   [-WA]
CS =0008 00000000 ffffffff 00c09b00 DPL=0 CS32 [-RA]
SS =0010 00000000 ffffffff 00c09300 DPL=0 DS   [-WA]
DS =0023 00000000 ffffffff 00c0f300 DPL=3 DS   [-WA]
FS =0030 82732c00 00003748 00409300 DPL=0 DS   [-WA]
GS =0000 00000000 ffffffff 00000000
LDT=0000 00000000 ffffffff 00000000
TR =0028 801da000 000020ab 00008b00 DPL=0 TSS32-busy
GDT=     80b95000 000003ff
IDT=     80b95400 000007ff
CR0=80010031 CR2=00540000 CR3=00185000 CR4=000006f8
DR0=00000000 DR1=00000000 DR2=00000000 DR3=00000000
DR6=ffff0ff0 DR7=00000400
FCW=027f FSW=0000 [ST=0] FTW=00 MXCSR=00000000
FPR0=0000000000000000 0000 FPR1=0000000000000000 0000
FPR2=0000000000000000 0000 FPR3=0000000000000000 0000
FPR4=0000000000000000 0000 FPR5=0000000000000000 0000
FPR6=0000000000000000 0000 FPR7=0000000000000000 0000
XMM00=00000000000000000000000000000000
XMM01=00000000000000000000000000000000
XMM02=00000000000000000000000000000000
XMM03=00000000000000000000000000000000
XMM04=00000000000000000000000000000000
XMM05=00000000000000000000000000000000
XMM06=00000000000000000000000000000000
XMM07=00000000000000000000000000000000
Comment 13 Cao, Chen 2011-01-04 20:50:27 EST
Created attachment 471787 [details]
stuck on the intel host
Comment 14 Cao, Chen 2011-01-04 20:50:54 EST
Created attachment 471788 [details]
stuck on the amd host
Comment 15 Gleb Natapov 2011-01-05 04:23:24 EST
I noticed that in comment #0 you use vhost=on. vhost is not supported on rhel6.0. Is this reproducible without vhost? If yes does qemu take 100% cpu when it stuck? Run ftrace after it stuck like this:

# echo kvm >  /sys/kernel/debug/tracing/set_event
# cat /sys/kernel/debug/tracing/trace > /tmp/trace

Attach /tmp/trace here.
Comment 16 Cao, Chen 2011-01-05 20:59:43 EST
Created attachment 471976 [details]
ftrace for kvm when Windows7 guest stuck while resuming from S4

(In reply to comment #15)
> I noticed that in comment #0 you use vhost=on. vhost is not supported on
> rhel6.0. Is this reproducible without vhost?

yes, this is reproducible without vhost, as the command line in comment #8.

> If yes does qemu take 100% cpu
> when it stuck?

top -p `pidof qemu-kvm` with "show threads on"
---
top - 09:37:36 up 1 day, 18:05, 20 users,  load average: 0.82, 0.93, 0.87
Tasks:   2 total,   1 running,   1 sleeping,   0 stopped,   0 zombie
Cpu0  :  2.0%us,  1.7%sy,  0.0%ni, 95.6%id,  0.7%wa,  0.0%hi, 0.0%si,  0.0%st
Cpu1  : 56.2%us, 43.8%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi, 0.0%si,  0.0%st
Cpu2  :  2.6%us,  1.5%sy,  0.0%ni, 95.4%id,  0.4%wa,  0.0%hi, 0.0%si,  0.0%st
Cpu3  :  1.3%us,  0.4%sy,  0.0%ni, 98.0%id,  0.0%wa,  0.0%hi, 0.2%si,  0.0%st
Mem:   7994548k total,  5606560k used,  2387988k free,   239796k buffers
Swap: 10239992k total,    67496k used, 10172496k free,  2715104k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
  9440 root      20   0 2314m 2.0g 2720 R 100.0 26.1  2484:08 qemu-kvm
  9422 root      20   0 2314m 2.0g 2720 S  3.1 26.1  77:33.64 qemu-kvm


> Run ftrace after it stuck like this:
> 
> # echo kvm >  /sys/kernel/debug/tracing/set_event
> # cat /sys/kernel/debug/tracing/trace > /tmp/trace
> 
> Attach /tmp/trace here.
Comment 17 Gleb Natapov 2011-01-06 02:29:02 EST
Try to reproduce without network ("-net none" option).
Comment 18 Cao, Chen 2011-01-07 04:09:52 EST
(In reply to comment #17)
> Try to reproduce without network ("-net none" option).

tried about 50+ times with -net none, cannot reproduce.
also cannot reproduce this problem when the net options are not
provided in the cmd line at all (user mode).

/usr/libexec/qemu-kvm -name vm1 \
-chardev socket,id=human_monitor_eqFd,path=/tmp/monitor-humanmonitor1-20101924- 114524-dMO8,server,nowait \
-mon chardev=human_monitor_eqFd,mode=readline \
-chardev socket,id=serial_i2zu,path=/tmp/serial-20101924-114524-dMO8,server,    nowait \
-device isa-serial,chardev=serial_i2zu \
-drive file=./win7-32-20110104.qcow2,index=0,if=none,id=drive-ide0-0-0,         media=disk,cache=writethrough,format=qcow2,aio=native \
-device ide-drive,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0 \
-m 2048 -smp 1 -cpu cpu64-rhel6,+sse2,+x2apic -vnc :12 \
-rtc base=utc,clock=host,driftfix=none -M rhel6.0.0 \
-usbdevice tablet -no-kvm-pit-reinjection -enable-kvm -net none



and tried qemu-kvm-0.12.1.2-2.113.el6_0.6, using cmd line provided in
comment 8, can still easily reproduce this problem.
Comment 19 Gleb Natapov 2011-02-01 03:47:26 EST
Looks like Windows 7 bug to me. I saw this already with e1000. Windows
7 during resume enables receive on nic too early (before it is ready
to receive interrupts). If there is a packet in nic's queue already,
it generates interrupt immediately and Windows hangs. It is possible
that our rtl8139 emulation start sending interrupt to early. I haven't
checked against rtl8139's spec, but when I investigated the same problem with e1000 I checked against e1000's spec what Windows 7 does and it looked like it does the wrong thing, so I assume that rtl8139 problem is the same. Would be
interesting to check with virtio where we control driver too.

On real HW such problem may not be visible since NIC does not start to 
receive packets immediately after receiver is enabled. It takes a 
couple of msecs to do link discovery and autonegotiation.
Comment 20 Gleb Natapov 2011-02-01 03:48:13 EST
Can you try with virtio-net?
Comment 21 Cao, Chen 2011-02-01 09:14:18 EST
(In reply to comment #20)
> Can you try with virtio-net?

I have tried more than 20 times, cannot reproduce
using virtio-net.


cmd:
/usr/libexec/qemu-kvm -name 'vm1' \
-chardev
socket,id=human_monitor_TJFq,path=/tmp/monitor-humanmonitor1-20110130-104015-Ghji,server,nowait
-mon chardev=human_monitor_TJFq,mode=readline \
-chardev
socket,id=serial_2Lar,path=/tmp/serial-20110130-104015-Ghji,server,nowait
-device isa-serial,chardev=serial_2Lar \
-drive
file='./win7-32-virtio.qcow2',index=0,if=none,id=drive-ide0-0-0,media=disk,cache=writethrough,format=qcow2,aio=native
\
-device ide-drive,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0 \
-device
virtio-net-pci,netdev=idP499NY,mac=9a:b4:c8:bf:da:c3,netdev=idP499NY,id=ndev00idP499NY,bus=pci.0,addr=0x3
-netdev
tap,id=idP499NY,ifname='t0-104015-Ghji',script='qemu-ifup-switch',downscript='no'
-m 512 -smp 1,cores=1,threads=1,sockets=1 \
-cpu cpu64-rhel6,+sse2,+x2apic -vnc :0 -rtc
base=localtime,clock=host,driftfix=none -M rhel6.0.0 -boot
order=cdn,once=c,menu=off -usbdevice tablet -enable-kvm
Comment 27 Mike Cao 2012-02-08 00:30:57 EST
Hi ,all 

I hit this issue sometimes during whql test .especially for win2003 guest .
After hit it ,QE need to reboot guest manually and choose "resume from disk" again ,it delays our test schedules if QE did not recognize the guest is hang .

What's more ,Bug 769163 's root cause is this one .I think this bug should be fixed ASAP .
Comment 28 Mike Cao 2012-02-08 00:32:04 EST
(In reply to comment #27)

> What's more ,Bug 769163 's root cause is this one .I think this bug should be
> fixed ASAP .
I mean Bug 769163 's root cause *might* be this bug
Comment 31 Ronen Hod 2012-02-27 02:01:26 EST
Yan,
What do you think?
Ronen.
Comment 32 Yan Vugenfirer 2012-02-28 11:17:53 EST
I think we should review e1000 spec and device implementation looking at actions during device reset.
Checking just RCTL register in e1000_receive looks naive to me.

Also - do I understand correctly that this issue cannot be reproduce with spice?

And another question - are you using some special BIOS?


And yet another option - try to connect WinDbg to the guest under test and break when it is stuck. If it will succeed - might give us some additional info were Windows is stuck.
Comment 33 jason wang 2012-03-02 05:29:33 EST
For the e1000 reset, I've tried Michale's fixes for resetting, it does not work (may have some defect but I haven't check). One interesting things is that this bz is only reproduced during resuming ( not booting), from the debug log, windows guest does something different for booting and resuming:

For booting, it enables interrupt before letting card receving packets, For resuming, it enables interrupt after letting card receving packets, and if there some packets come before enable the interrupt, when guest tries to enable the interrupt, after an irq were injected to guest, guest would hang. As windows driver behaves differently, it may have good reason that there's something wrong with the order of irq enabling and irq handle registering.

I see tons of unhandled irq of e1000/8139 were injected during the eoi broadcast. And it seems guest have no time doing other things execpt trying to handling those irqs without handlers.

It seems our IOAPIC EOI broadcast emulation would re-devlier the irq immediately if it found the irq is still active (which is common when there's no irq handle registered in guest). So after each time when guest try to leave irq handler and re-enable irq, the irq-window handler would always inject the that irq to guest. As the this repeat again and again, guest would be busying and never have time to move forward.

In conclusion, if one level irq were unhandled, it would be injected to guest endlessly and as guest can't do the following steps such as registering its handler and would hang forever. Not sure this is exactly what read hw behaves. This can be also reproduced with linux guest when a unhandled level irq were rasied ( see bz787959 which is a driver bug). During my exam, if we can let guest move a little before reinject the irq, everything would be fine.

So there's a high possibility that windows driver has a bug (enabling the irq before its handler is registerd).
Comment 42 jason wang 2012-03-11 22:42:41 EDT
*** Bug 716804 has been marked as a duplicate of this bug. ***
Comment 44 Ronen Hod 2012-03-12 02:48:43 EDT
Jason has a solution, sending upstream.
We prefer to wait to the next Z-stream / 6.4
Comment 50 langfang 2012-10-12 07:27:01 EDT
reproduce this bug as follow version:
host:
# uname -r
2.6.32-279.el6.x86_64
# rpm -q qemu-kvm
qemu-kvm-0.12.1.2-2.295.el6.x86_64

guest:
win7-32

steps:
1.boot guest with rtl8139 NIC
/usr/libexec/qemu-kvm -name 'vm1' -chardev socket,id=human_monitor_eqFd,pp/monitor-humanmonitor1-20101124-114524-dMO8,server,nowait -mon chardev=human_monitor_eqFd,mode=readline -chardev socket,id=serial_i2zu,path=/tmp/serial-20101124-114524-dMO8,server,nowait -device isa-serial,chardev=serial_i2zu -drive file=/home/win7-32.qcow2,index=0,if=none,id=drive-ide0-0-0,media=disk,cache=writethrough,format=qcow2,aio=native -device ide-drive,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0 -device rtl8139,mac=9a:3e:52:90:38:8e,netdev=idtaTtKT,id=ndev00idtaTtKT,bus=pci.0,addr=0x3 -netdev tap,id=idtaTtKT,ifname='t0-114524-dMO8',script=/etc/qemu-ifup,downscript='no' -m 2048 -smp 1 -cpu Penryn -rtc base=utc,clock=host,driftfix=none -M rhel6.3.0 -usb -device usb-tablet -no-kvm-pit-reinjection -enable-kvm -spice port=5931,disable-ticketing -vga qxl -global qxl-vga.vram_size=67108864 -monitor stdio -bios /usr/share/seabios/bios-pm.bin
2.do S3/S4

ctrl+alt+del---->choose sleep/Hibernate

result:
after do S4,then resume  guest --->guest becomes unresponsive while resuming.

test this bug as follow version:

host:
# uname -r
2.6.32-279.el6.x86_64
# rpm -q qemu-kvm
qemu-kvm-0.12.1.2-2.321.el6.x86_64

guest:
win7-32

steps:
1.boot guest with e1000 NIC
2.do S3/S4


results:tried more than 5 times after do S3/S4 -->guest resume successfully,guest work well.

addinfo:
1)if boot guest with rtl8139 NIC on the new qemu version,this issue still have.Guest becomes unresponsive while resuming
Comment 51 jason wang 2012-10-15 01:24:06 EDT
(In reply to comment #50)
> reproduce this bug as follow version:
> host:
> # uname -r
> 2.6.32-279.el6.x86_64
> # rpm -q qemu-kvm
> qemu-kvm-0.12.1.2-2.295.el6.x86_64
> 
> guest:
> win7-32
> 
> steps:
> 1.boot guest with rtl8139 NIC
> /usr/libexec/qemu-kvm -name 'vm1' -chardev
> socket,id=human_monitor_eqFd,pp/monitor-humanmonitor1-20101124-114524-dMO8,
> server,nowait -mon chardev=human_monitor_eqFd,mode=readline -chardev
> socket,id=serial_i2zu,path=/tmp/serial-20101124-114524-dMO8,server,nowait
> -device isa-serial,chardev=serial_i2zu -drive
> file=/home/win7-32.qcow2,index=0,if=none,id=drive-ide0-0-0,media=disk,
> cache=writethrough,format=qcow2,aio=native -device
> ide-drive,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0 -device
> rtl8139,mac=9a:3e:52:90:38:8e,netdev=idtaTtKT,id=ndev00idtaTtKT,bus=pci.0,
> addr=0x3 -netdev
> tap,id=idtaTtKT,ifname='t0-114524-dMO8',script=/etc/qemu-ifup,
> downscript='no' -m 2048 -smp 1 -cpu Penryn -rtc
> base=utc,clock=host,driftfix=none -M rhel6.3.0 -usb -device usb-tablet
> -no-kvm-pit-reinjection -enable-kvm -spice port=5931,disable-ticketing -vga
> qxl -global qxl-vga.vram_size=67108864 -monitor stdio -bios
> /usr/share/seabios/bios-pm.bin
> 2.do S3/S4
> 
> ctrl+alt+del---->choose sleep/Hibernate
> 
> result:
> after do S4,then resume  guest --->guest becomes unresponsive while resuming.
> 
> test this bug as follow version:
> 
> host:
> # uname -r
> 2.6.32-279.el6.x86_64
> # rpm -q qemu-kvm
> qemu-kvm-0.12.1.2-2.321.el6.x86_64
> 
> guest:
> win7-32
> 
> steps:
> 1.boot guest with e1000 NIC
> 2.do S3/S4
> 
> 
> results:tried more than 5 times after do S3/S4 -->guest resume
> successfully,guest work well.
> 

Could you please try more times, e.g. 1000 times of s3/s4 through autotest?
> addinfo:
> 1)if boot guest with rtl8139 NIC on the new qemu version,this issue still
> have.Guest becomes unresponsive while resuming

FYI, the rtl8139 issue were in another bug https://bugzilla.redhat.com/show_bug.cgi?id=847241 which is closed as WON'TFIX since we would not put any effort on 8139 issue.
Comment 53 langfang 2012-10-25 01:08:39 EDT
reproduce and verify this bug about test 200 times, on the fixed qemu-kvm version not hit the bug problem. so this bug has been fixed.
Comment 55 errata-xmlrpc 2013-02-21 02:29:58 EST
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2013-0527.html

Note You need to log in before you can comment on or make changes to this bug.