Bug 1453090

Summary: [virtio-win]DPC_WATCHDOG_VIOLATION error when do balloon stress on Win8.1-x86_64 guest
Product: Red Hat Enterprise Linux 7 Reporter: Suqin Huang <shuang>
Component: virtio-winAssignee: Amnon Ilan <ailan>
virtio-win sub component: virtio-win-prewhql QA Contact: Virtualization Bugs <virt-bugs>
Status: CLOSED ERRATA Docs Contact:
Severity: unspecified    
Priority: medium CC: ailan, ddepaula, lijin, lmiksik, mtessun, phou, vrozenfe, xiagao
Version: 7.4   
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-04-10 06:28:08 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1473046    
Attachments:
Description Flags
error info none

Description Suqin Huang 2017-05-22 06:26:58 UTC
Created attachment 1280908 [details]
error info

Description of problem:


Version-Release number of selected component (if applicable):
virtio-win-prewhql-0.1-137
qemu-kvm-rhev-2.9.0-4.el7.x86_64

How reproducible:
some time

Steps to Reproduce:
1. run video in background

/usr/libexec/qemu-kvm \
    -S  \
    -name 'avocado-vt-vm1'  \
    -sandbox off  \
    -machine pc  \
    -nodefaults  \
    -vga std  \
    -chardev socket,id=qmp_id_qmpmonitor1,path=/var/tmp/avocado_VSspUc/monitor-qmpmonitor1-20170522-013229-Q6Zfqcvc,server,nowait \
    -mon chardev=qmp_id_qmpmonitor1,mode=control  \
    -chardev socket,id=qmp_id_catch_monitor,path=/var/tmp/avocado_VSspUc/monitor-catch_monitor-20170522-013229-Q6Zfqcvc,server,nowait \
    -mon chardev=qmp_id_catch_monitor,mode=control \
    -device pvpanic,ioport=0x505,id=idf4QrVI  \
    -chardev socket,id=serial_id_serial0,path=/var/tmp/avocado_VSspUc/serial-serial0-20170522-013229-Q6Zfqcvc,server,nowait \
    -device isa-serial,chardev=serial_id_serial0  \
    -chardev socket,id=seabioslog_id_20170522-013229-Q6Zfqcvc,path=/var/tmp/avocado_VSspUc/seabios-20170522-013229-Q6Zfqcvc,server,nowait \
    -device isa-debugcon,chardev=seabioslog_id_20170522-013229-Q6Zfqcvc,iobase=0x402 \
    -device ich9-usb-ehci1,id=usb1,addr=0x1d.7,multifunction=on,bus=pci.0 \
    -device ich9-usb-uhci1,id=usb1.0,multifunction=on,masterbus=usb1.0,addr=0x1d.0,firstport=0,bus=pci.0 \
    -device ich9-usb-uhci2,id=usb1.1,multifunction=on,masterbus=usb1.0,addr=0x1d.2,firstport=2,bus=pci.0 \
    -device ich9-usb-uhci3,id=usb1.2,multifunction=on,masterbus=usb1.0,addr=0x1d.4,firstport=4,bus=pci.0 \
    -drive id=drive_image1,if=none,snapshot=off,aio=native,cache=none,format=qcow2,file=/home/kvm_autotest_root/images/win8-64.1-virtio.qcow2 \
    -device virtio-blk-pci,id=image1,drive=drive_image1,bootindex=0,bus=pci.0,addr=0x3 \
    -device virtio-net-pci,mac=9a:63:64:65:66:67,id=idcudggr,vectors=4,netdev=idRUhH89,bus=pci.0,addr=0x4  \
    -netdev tap,id=idRUhH89,vhost=on,vhostfd=20,fd=19 \
    -m 8192  \
    -smp 8,cores=4,threads=1,sockets=2  \
    -cpu 'Opteron_G3',+kvm_pv_unhalt,hv_spinlocks=0x1fff,hv_vapic,hv_time \
    -drive id=drive_cd1,if=none,snapshot=off,aio=native,cache=none,media=cdrom,file=/home/kvm_autotest_root/iso/windows/winutils.iso \
    -device ide-cd,id=cd1,drive=drive_cd1,bootindex=1,bus=ide.0,unit=0 \
    -device usb-tablet,id=usb-tablet1,bus=usb1.0,port=1  \
    -vnc :0  \
    -rtc base=localtime,clock=host,driftfix=slew  \
    -boot order=cdn,once=c,menu=off,strict=off \
    -enable-kvm \
    -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x5



2. repeat balloon test with random mem 500 times
{'execute': 'balloon', 'arguments': {'value': 7796162560}, 'id': 'fR5eEZSM'}

3.

Actual results:
DPC_WATCHDOG_VIOLATION error

Expected results:


Additional info:

Guest: Win8.1-x86_64
host: 
processor	: 15
vendor_id	: AuthenticAMD
cpu family	: 16
model		: 9
model name	: AMD Opteron(tm) Processor 6128
stepping	: 1
microcode	: 0x10000d9
cpu MHz		: 2000.089
cache size	: 512 KB
physical id	: 1
siblings	: 8
core id		: 3
cpu cores	: 8
apicid		: 39
initial apicid	: 23
fpu		: yes
fpu_exception	: yes
cpuid level	: 5
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc art rep_good nopl nonstop_tsc extd_apicid amd_dcm pni monitor cx16 popcnt lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt nodeid_msr hw_pstate npt lbrv svm_lock nrip_save pausefilter

Comment 5 lijin 2017-05-26 06:56:06 UTC
following is the windbg info,looks not balloon issue

0: kd> !analyze -v
*******************************************************************************
*                                                                             *
*                        Bugcheck Analysis                                    *
*                                                                             *
*******************************************************************************

DPC_WATCHDOG_VIOLATION (133)
The DPC watchdog detected a prolonged run time at an IRQL of DISPATCH_LEVEL
or above.
Arguments:
Arg1: 0000000000000001, The system cumulatively spent an extended period of time at
	DISPATCH_LEVEL or above. The offending component can usually be
	identified with a stack trace.
Arg2: 0000000000001e00, The watchdog period.
Arg3: 0000000000000000
Arg4: 0000000000000000

Debugging Details:
------------------


DPC_TIMEOUT_TYPE:  DPC_QUEUE_EXECUTION_TIMEOUT_EXCEEDED

DEFAULT_BUCKET_ID:  WIN8_DRIVER_FAULT

BUGCHECK_STR:  0x133

PROCESS_NAME:  System

CURRENT_IRQL:  d

ANALYSIS_VERSION: 6.3.9600.16384 (debuggers(dbg).130821-1623) amd64fre

LAST_CONTROL_TRANSFER:  from fffff8009fc1bcf0 to fffff8009fbd59a0

STACK_TEXT:  
fffff800`a146fc88 fffff800`9fc1bcf0 : 00000000`00000133 00000000`00000001 00000000`00001e00 00000000`00000000 : nt!KeBugCheckEx
fffff800`a146fc90 fffff800`9fb400e7 : 00000000`00000000 00000000`00000000 00000000`00000001 fffff800`9fbd77a8 : nt! ?? ::FNODOBFM::`string'+0x35e40
fffff800`a146fd20 fffff800`9fa1d67f : fffff800`6a28c020 00000000`00000000 fffff800`9fa699b0 00000000`00000000 : nt!KeClockInterruptNotify+0x787
fffff800`a146ff40 fffff800`9fb674f3 : fffff800`9fa69900 00000000`00000000 ffffd001`751823c0 fffff800`9fbd6a87 : hal!HalpTimerClockInterrupt+0x4f
fffff800`a146ff70 fffff800`9fbd6e2a : fffff800`9fa69900 00000000`c00000bb 00000000`00000002 ffffe000`2c530440 : nt!KiCallInterruptServiceRoutine+0xa3
fffff800`a146ffb0 fffff800`9fbd779b : 00000000`00000000 00000000`000005b4 ffffe000`000005b4 00000000`00000000 : nt!KiInterruptSubDispatchNoLockNoEtw+0xea
fffff800`a1467230 fffff800`6a120d88 : ffffe000`2c5c8ec0 fffff800`a14674c0 ffffe000`2c5c8ec0 ffffe000`2b3e7980 : nt!KiInterruptDispatchNoLockNoEtw+0xfb
fffff800`a14673c0 fffff800`6a107571 : ffffe000`00000000 ffffe000`2b798010 fffff800`a1467559 fffff800`a14675f0 : tcpip!IpNlpSetPathInfo+0x616
fffff800`a14674a0 fffff800`6a106d58 : ffffe000`2c3550a0 00000000`7b0e6809 fffff800`a1467930 ffffe000`2c3550a0 : tcpip!TcpBwUpdateEre+0x59c
fffff800`a14675c0 fffff800`6a10afa7 : fffff800`a1467740 fffff800`a1467930 00000004`ce7b1f2a 00000000`000005b4 : tcpip!TcpBwManageTrackedData+0x128
fffff800`a1467640 fffff800`6a10c787 : ffffe000`2bc88d08 ffffe000`2bb1d1a0 fffff800`00000000 00000000`00000001 : tcpip!TcpTcbFastDatagram+0x757
fffff800`a1467890 fffff800`6a10d370 : 00000000`00000000 fffff800`6a1453bf ffffe000`29ca6650 ffffe000`2b69dbb0 : tcpip!TcpTcbReceive+0x247
fffff800`a14679e0 fffff800`6a10c123 : ffffd001`77162022 ffffe000`2b73a140 00000000`00000000 ffffe000`2b673f40 : tcpip!TcpMatchReceive+0x1f0
fffff800`a1467b70 fffff800`6a141f9f : ffffe000`2b773470 fffff800`a1467e99 fffff800`a14620c0 00000000`00000000 : tcpip!TcpPreValidatedReceive+0x363
fffff800`a1467c70 fffff800`6a13e143 : fffff800`00000000 fffff800`00000000 ffffe000`ffffffff fffff800`00000000 : tcpip!IppDeliverListToProtocol+0x4f
fffff800`a1467d30 fffff800`6a13c525 : ffffe000`2bc137b0 00000000`00000000 ffffe000`2b5bd700 fffff800`a1467e38 : tcpip!IppProcessDeliverList+0x63
fffff800`a1467dd0 fffff800`6a13a900 : ffffe000`2bb1d1a0 ffffe000`2b803c80 fffff800`6a28c180 ffffe000`2b73a000 : tcpip!IppReceiveHeaderBatch+0x235
fffff800`a1467f00 fffff800`6a138d22 : ffffe000`2b9c2510 00000000`00000000 fffff800`a1468301 00000000`00000000 : tcpip!IppFlcReceivePacketsCore+0x680
fffff800`a1468280 fffff800`6a139745 : ffffe000`2b9e0002 00000000`00000000 fffff800`6a139790 ffffe000`00000101 : tcpip!FlpReceiveNonPreValidatedNetBufferListChain+0x318
fffff800`a1468360 fffff800`9fac3873 : 00000000`0000000a 00000000`00000000 ffffe000`2b78c910 fffff800`a1463000 : tcpip!FlReceiveNetBufferListChainCalloutRoutine+0x155
fffff800`a1468490 fffff800`6a139936 : fffff800`6a1395f0 fffff800`a14685b0 00000000`00000010 00000000`00000001 : nt!KeExpandKernelStackAndCalloutInternal+0xf3
fffff800`a1468580 fffff800`69e45a53 : 00000000`00000000 fffff800`a1468661 00000000`00000006 ffffe000`2bbb3d38 : tcpip!FlReceiveNetBufferListChain+0xb6
fffff800`a1468600 fffff800`69e45e7f : 00000000`00000001 00000000`00000000 00000000`00000000 00000000`00000006 : ndis!ndisMIndicateNetBufferListsToOpen+0x123
fffff800`a14686c0 fffff800`69e466b2 : ffffe000`2bb1d1a0 ffffe563`aa3a9e01 fffff800`69e524e0 fffff800`6ab7db67 : ndis!ndisMTopReceiveNetBufferLists+0x22f
fffff800`a1468750 fffff800`6ab75f81 : 00000000`0000000a ffffe000`2bbaf910 ffffe000`2bba9548 fffff800`6a13368c : ndis!NdisMIndicateReceiveNetBufferLists+0x732
fffff800`a1468940 fffff800`6ab73358 : fffff800`0000000a ffffe000`29ca6650 fffff800`00000000 ffffe000`2b803c80 : netkvm+0x4f81
fffff800`a14689b0 fffff800`6ab7f988 : ffffe000`2bba9000 fffff800`a1468ad0 00000000`00000000 00000000`00000007 : netkvm+0x2358
fffff800`a1468a10 fffff800`69e541b8 : ffffe000`2b7a4980 fffff800`a1468a78 fffff800`00000000 00001f80`00e5026a : netkvm+0xe988
fffff800`a1468a70 fffff800`69e544b3 : ffffe000`2bc12528 00000000`0000000a 00000000`00000001 00000000`00000000 : ndis!ndisMiniportDpc+0x110
fffff800`a1468b10 fffff800`69e6680e : 00000004`00000005 ffffe000`2b6073d0 fffff800`a1468ea0 ffffe000`2bc12e68 : ndis!ndisQueuedMiniportDpcWorkItem+0xfb
fffff800`a1468bb0 fffff800`9fb4fcd0 : fffff800`9fd8af00 fffff800`a1468ea0 00000000`00000000 00000000`00000000 : ndis!ndisPeriodicReceivesTimer+0x12a
fffff800`a1468be0 fffff800`9fb4ef87 : ffffe000`2c7d9080 fffff800`a1468e30 fffff800`9fd88180 fffff800`9fd88180 : nt!KiExecuteAllDpcs+0x1b0
fffff800`a1468d30 fffff800`9fbd8ad5 : 00000000`00000000 fffff800`9fd88180 fffff800`9fa69900 00000000`00000000 : nt!KiRetireDpcList+0xd7
fffff800`a1468fb0 fffff800`9fbd88d9 : fffff800`9fd88180 fffff800`9fbd88ef ffffe000`28d1d300 ffffd001`74ecddd0 : nt!KxRetireDpcList+0x5
ffffd001`74ecd8a0 fffff800`9fbdab45 : fffff800`9fa69900 fffff800`9fbd7813 ffffd001`74ecd8f0 00000000`00000018 : nt!KiDispatchInterruptContinue
ffffd001`74ecd8d0 fffff800`9fbd7813 : ffffd001`74ecd8f0 00000000`00000018 ffffe000`2c31b880 fffff800`9fd88180 : nt!KiDpcInterruptBypass+0x25
ffffd001`74ecd8e0 fffff800`9fab6da9 : fffff800`9fd88180 ffffe000`28d1d300 ffffd001`fffffffe fffff800`fffffffe : nt!KiInterruptDispatchNoLockNoEtw+0x173
ffffd001`74ecda70 fffff800`9fab67f9 : 00000000`00000001 fffff800`9fd8af28 00000000`00000000 00000000`00000001 : nt!KiSwapThread+0x179
ffffd001`74ecdb10 fffff800`9faf667f : 00000000`00000000 00000000`00000005 00000000`00000000 ffffe000`28d1d300 : nt!KiCommitThreadWait+0x129
ffffd001`74ecdb90 fffff800`9fbb45b0 : fffff800`9fd88180 00000000`00000000 ffffe000`28d1d300 00bb0000`00000026 : nt!KeWaitForGate+0x10b
ffffd001`74ecdbe0 fffff800`9fb86c70 : 00530059`0053000a ffffe000`28d1d300 00000000`00000080 00010000`0000b8c8 : nt!KiExecuteDpc+0x74
ffffd001`74ecdd40 fffff800`9fbdbfc6 : ffffd001`78bc7180 ffffe000`28d1d300 ffffd001`78bd33c0 00310047`004f004c : nt!PspSystemThreadStartup+0x58
ffffd001`74ecdda0 00000000`00000000 : ffffd001`74ece000 ffffd001`74ec8000 00000000`00000000 00000000`00000000 : nt!KiStartSystemThread+0x16


STACK_COMMAND:  kb

FOLLOWUP_IP: 
tcpip!IpNlpSetPathInfo+616
fffff800`6a120d88 488b0d91b21600  mov     rcx,qword ptr [tcpip!WPP_GLOBAL_Control (fffff800`6a28c020)]

SYMBOL_STACK_INDEX:  7

SYMBOL_NAME:  tcpip!IpNlpSetPathInfo+616

FOLLOWUP_NAME:  MachineOwner

MODULE_NAME: tcpip

IMAGE_NAME:  tcpip.sys

DEBUG_FLR_IMAGE_TIMESTAMP:  54505542

BUCKET_ID_FUNC_OFFSET:  616

FAILURE_BUCKET_ID:  0x133_VRF_ISR_tcpip!IpNlpSetPathInfo

BUCKET_ID:  0x133_VRF_ISR_tcpip!IpNlpSetPathInfo

ANALYSIS_SOURCE:  KM

FAILURE_ID_HASH_STRING:  km:0x133_vrf_isr_tcpip!ipnlpsetpathinfo

FAILURE_ID_HASH:  {d0da75d6-6f5f-c4bc-a004-0c20d177c7c6}

Followup: MachineOwner
---------

Comment 6 Ladi Prosek 2017-05-26 08:13:26 UTC
It may still be caused by balloon, even if it's not on the stack when the bugcheck fired. Could you please try this a few more times to see if you can get dumps with balloon on the stack?

One thing we may improve in the driver is make sure that we don't hold locks when notifying virtqueues. If for some reason the host takes longer to process the virtqueue, the driver is stuck in the kick I/O with a spinlock held -- and that means IRLQ = DISPATCH_LEVEL.

virtqueue_kick can be split into virtqueue_kick_prepare (must run under lock) and virtqueue_notify (may run at PASSIVE_LEVEL). Commit db6a904 is an example of this.

Comment 7 xiagao 2017-05-31 06:11:30 UTC
(In reply to Ladi Prosek from comment #6)
> It may still be caused by balloon, even if it's not on the stack when the
> bugcheck fired. Could you please try this a few more times to see if you can
> get dumps with balloon on the stack?

Could you give some suggestion how to reproduce it for getting dumps with balloon on the stack?

Thanks,
xiagao
> 
> One thing we may improve in the driver is make sure that we don't hold locks
> when notifying virtqueues. If for some reason the host takes longer to
> process the virtqueue, the driver is stuck in the kick I/O with a spinlock
> held -- and that means IRLQ = DISPATCH_LEVEL.
> 
> virtqueue_kick can be split into virtqueue_kick_prepare (must run under
> lock) and virtqueue_notify (may run at PASSIVE_LEVEL). Commit db6a904 is an
> example of this.

Comment 8 Ladi Prosek 2017-05-31 06:27:15 UTC
(In reply to xiagao from comment #7)
> (In reply to Ladi Prosek from comment #6)
> > It may still be caused by balloon, even if it's not on the stack when the
> > bugcheck fired. Could you please try this a few more times to see if you can
> > get dumps with balloon on the stack?
> 
> Could you give some suggestion how to reproduce it for getting dumps with
> balloon on the stack?

We just need to keep trying, using the steps in comment 0, and hope balloon will be on the stack when the BSOD occurs. I don't think there's anything we can do to help it.

Can you please get a few more dumps and share them with me? Thanks!

Comment 9 xiagao 2017-06-26 05:44:56 UTC
(In reply to Ladi Prosek from comment #8)
> (In reply to xiagao from comment #7)
> > (In reply to Ladi Prosek from comment #6)
> > > It may still be caused by balloon, even if it's not on the stack when the
> > > bugcheck fired. Could you please try this a few more times to see if you can
> > > get dumps with balloon on the stack?
> > 
> > Could you give some suggestion how to reproduce it for getting dumps with
> > balloon on the stack?
> 
> We just need to keep trying, using the steps in comment 0, and hope balloon
> will be on the stack when the BSOD occurs. I don't think there's anything we
> can do to help it.
> 
> Can you please get a few more dumps and share them with me? Thanks!

I have tried 150 times and have not hit this issue.

Comment 10 Ladi Prosek 2017-06-26 05:53:37 UTC
(In reply to xiagao from comment #9)
> I have tried 150 times and have not hit this issue.

Bummer. I'll do one more pass over the source code to see if there's anything else that we can improve. If nothing else, I'll at least split virtqueue_kick per comment 6. Thanks!

Comment 12 lijin 2017-08-29 06:42:09 UTC
package info:
virtio-win-1.9.3-1.el7.iso
qemu-kvm-rhev-2.9.0-16.el7_4.5.x86_64
kernel-3.10.0-693.el7.x86_64
seabios-bin-1.10.2-3.el7_4.1.noarch

Comment 20 lijin 2017-09-25 06:14:06 UTC
I file bz1495070 to track the iometer issue

Comment 21 Ladi Prosek 2017-11-07 13:30:50 UTC
The virtqueue locking tweak outlined in comment 6 has been committed as:

https://github.com/virtio-win/kvm-guest-drivers-windows/commit/75d2bd1d35aae39389b05b12191142ae0ed75906

Comment 25 Peixiu Hou 2017-11-20 09:50:39 UTC
Test Used Versions:
kernel-3.10.0-776.el7.x86_64
qemu-kvm-rhev-2.10.0-6.el7.x86_64
seabios-bin-1.11.0-1.el7.noarch
virtio-win-prewhql-143

Host info:
[root@amd-9600b-8-1 ~]# lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                4
On-line CPU(s) list:   0-3
Thread(s) per core:    1
Core(s) per socket:    4
Socket(s):             1
NUMA node(s):          1
Vendor ID:             AuthenticAMD
CPU family:            16
Model:                 2
Model name:            AMD Phenom(tm) 9600B Quad-Core Processor
Stepping:              3
CPU MHz:               2300.000
CPU max MHz:           2300.0000
CPU min MHz:           1150.0000
BogoMIPS:              4609.93
Virtualization:        AMD-V
L1d cache:             64K
L1i cache:             64K
L2 cache:              512K
L3 cache:              2048K
NUMA node0 CPU(s):     0-3
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc art rep_good nopl nonstop_tsc extd_apicid pni monitor cx16 popcnt lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs hw_pstate vmmcall npt lbrv svm_lock

Thanks~

Comment 26 lijin 2017-11-20 09:52:44 UTC
change status to verified according to comment#23 and comment#24

Comment 29 errata-xmlrpc 2018-04-10 06:28:08 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:0657