Bug 1458626

Summary: [virtio-win][netkvm] win2012R2 BSOD after migration during netperf test
Product: Red Hat Enterprise Linux 7 Reporter: lijin <lijin>
Component: virtio-winAssignee: ybendito
virtio-win sub component: virtio-win-prewhql QA Contact: lijin <lijin>
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: high CC: ailan, jen, jherrman, lijin, mtessun, sjubran, wyu, ybendito
Version: 7.4Keywords: Regression, ZStream
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Prior to this update, after migrating a Windows Server 2012 R2 guest, the guest in certain cases failed to boot and displayed a blue error screen. The bug has been fixed, and the described problem no longer occurs.
Story Points: ---
Clone Of:
: 1473575 (view as bug list) Environment:
Last Closed: 2018-04-10 06:31:38 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1473046, 1473575    
Attachments:
Description Flags
Fix RC1 for 2012R2 none

Description lijin 2017-06-05 02:48:20 UTC
Description of problem:


Version-Release number of selected component (if applicable):
kernel-3.10.0-675.el7.x86_64
qemu-kvm-rhev-2.9.0-7.el7.x86_64
seabios-1.10.2-2.el7.x86_64
virtio-win-prewhql-139

How reproducible:
60%

Steps to Reproduce:
1.boot win2012R2 guest with netkvm device:
/usr/libexec/qemu-kvm -M pc -cpu host -enable-kvm -m 2G -smp 4,cores=4 -nodefconfig -rtc base=localtime,driftfix=slew -object iothread,id=thread0 -drive file=win2012R2-floppy.qcow2,if=none,serial=virtioblk1,format=qcow2,cache=none,werror=stop,rerror=stop,id=drive-virtio-disk0,aio=native -device ide-drive,drive=drive-virtio-disk0,id=virtio-disk0 -device piix3-usb-uhci,id=usb -device usb-tablet,id=tablet0 -k en-us -qmp tcp:0:4444,server,nowait -boot menu=on -monitor stdio -cdrom en_windows_server_2012_r2_x64_dvd_2707946.iso -fda virtio-win-prewhql-139.vfd -drive file=virtio-win-prewhql-139.iso,if=none,media=cdrom,id=drive-ide0-1-0,readonly=on,format=raw -device ide-drive,bus=ide.0,unit=0,drive=drive-ide0-1-0,id=ide0-1-0 -vnc 0.0.0.0:1 -vga std -netdev tap,id=hostnet1,script=/etc/qemu-ifup,vhost=on,queues=4 -device virtio-net-pci,mq=on,vectors=10,netdev=hostnet1,mac=4e:63:28:bc:b1:01,id=net1 


2.run netperf test in guest:
widows cmd:
# for %i in (32 64 128 256 512 1024 2048 4096 8192 16384 32768) do for %j in (tcp_stream,udp_stream) do netperf.exe -H 10.66.10.110 -C -c -t %j -f m -l 10s -- -m %i done done

3.start listening port with the same cli and add "-incoming tcp::5888":
/usr/libexec/qemu-kvm -M pc -cpu host -enable-kvm -m 2G -smp 4,cores=4 -nodefconfig -rtc base=localtime,driftfix=slew -object iothread,id=thread0 -drive file=win2012R2-floppy.qcow2,if=none,serial=virtioblk1,format=qcow2,cache=none,werror=stop,rerror=stop,id=drive-virtio-disk0,aio=native -device ide-drive,drive=drive-virtio-disk0,id=virtio-disk0 -device piix3-usb-uhci,id=usb -device usb-tablet,id=tablet0 -k en-us -qmp tcp:0:4445,server,nowait -boot menu=on -monitor stdio -cdrom en_windows_server_2012_r2_x64_dvd_2707946.iso -fda virtio-win-prewhql-139.vfd -drive file=virtio-win-prewhql-139.iso,if=none,media=cdrom,id=drive-ide0-1-0,readonly=on,format=raw -device ide-drive,bus=ide.0,unit=0,drive=drive-ide0-1-0,id=ide0-1-0 -vnc 0.0.0.0:2 -vga std -netdev tap,id=hostnet1,script=/etc/qemu-ifup,vhost=on,queues=4 -device virtio-net-pci,mq=on,vectors=10,netdev=hostnet1,mac=4e:63:28:bc:b1:01,id=net1 -incoming tcp::5888

4.during step3,do local migration:
(qemu)   migrate -d tcp:localhost:5888


Actual results:
After migration,the dst guest bsod.

Expected results:
guest works normally,no bsod

Additional info:
1.can reproduce with build 137
2.try 10 times with virtio-win-1.9.0-3.el7.noarch,can NOT reproduce,so it's a regression bug

Comment 4 lijin 2017-06-05 02:55:54 UTC
0: kd> !analyze -v
*******************************************************************************
*                                                                             *
*                        Bugcheck Analysis                                    *
*                                                                             *
*******************************************************************************

IRQL_NOT_LESS_OR_EQUAL (a)
An attempt was made to access a pageable (or completely invalid) address at an
interrupt request level (IRQL) that is too high.  This is usually
caused by drivers using improper addresses.
If a kernel debugger is available get the stack backtrace.
Arguments:
Arg1: ffffcf8003c3ecd0, memory referenced
Arg2: 0000000000000002, IRQL
Arg3: 0000000000000000, bitfield :
	bit 0 : value 0 = read operation, 1 = write operation
	bit 3 : value 0 = not an execute operation, 1 = execute operation (only on chips which support this level of status)
Arg4: fffff801ffef538f, address which referenced memory

Debugging Details:
------------------

*** ERROR: Module load completed but symbols could not be loaded for netkvm.sys

READ_ADDRESS:  ffffcf8003c3ecd0 Special pool

CURRENT_IRQL:  2

FAULTING_IP: 
nt!VfPutScatterGatherList+cb
fffff801`ffef538f 458b37          mov     r14d,dword ptr [r15]

DEFAULT_BUCKET_ID:  WIN8_DRIVER_FAULT

BUGCHECK_STR:  AV

PROCESS_NAME:  System

ANALYSIS_VERSION: 6.3.9600.16384 (debuggers(dbg).130821-1623) amd64fre

DPC_STACK_BASE:  FFFFF80200E70FB0

TRAP_FRAME:  fffff80200e69350 -- (.trap 0xfffff80200e69350)
NOTE: The trap frame does not contain all registers.
Some register values may be zeroed or incorrect.
rax=ffffe00000c5f058 rbx=0000000000000000 rcx=0000000000000002
rdx=0000000000000060 rsi=0000000000000000 rdi=0000000000000000
rip=fffff801ffef538f rsp=fffff80200e694e0 rbp=ffffe00002231950
 r8=0000000000000000  r9=fffff801ffb7a180 r10=0000000000000100
r11=fffff80200e694d8 r12=0000000000000000 r13=0000000000000000
r14=0000000000000000 r15=0000000000000000
iopl=0         nv up ei pl zr na po nc
nt!VfPutScatterGatherList+0xcb:
fffff801`ffef538f 458b37          mov     r14d,dword ptr [r15] ds:00000000`00000000=????????
Resetting default scope

LAST_CONTROL_TRANSFER:  from fffff801ff9e4be9 to fffff801ff9d90a0

STACK_TEXT:  
fffff802`00e69208 fffff801`ff9e4be9 : 00000000`0000000a ffffcf80`03c3ecd0 00000000`00000002 00000000`00000000 : nt!KeBugCheckEx
fffff802`00e69210 fffff801`ff9e343a : 00000000`00000000 ffffe000`01651f08 00000000`00000000 fffff802`00e69350 : nt!KiBugCheckDispatch+0x69
fffff802`00e69350 fffff801`ffef538f : ffffcf80`02d2cf50 ffffe000`01651ec0 ffffe000`01651ec0 ffffe000`000423f8 : nt!KiPageFault+0x23a
fffff802`00e694e0 fffff800`0069af21 : ffffcf80`02d2cf50 fffff802`00e69639 fffff802`00e69608 fffff801`ff820d60 : nt!VfPutScatterGatherList+0xcb
fffff802`00e69540 fffff800`0219262c : 00000000`00000000 fffff802`00e69608 ffffcf80`03d9ef30 fffff801`ffeff46a : NDIS!NdisMFreeNetBufferSGList+0x31
fffff802`00e69580 fffff800`02192213 : 00000000`00000002 fffff802`00e69639 fffff802`00e69618 00000000`00000000 : netkvm+0x862c
fffff802`00e695b0 fffff800`02192e8f : ffffcf80`036ce560 00000000`00000000 ffffcf80`036ce480 fffff801`ffeff76b : netkvm+0x8213
fffff802`00e695e0 fffff800`0218c3e7 : ffffe000`00000000 00000000`00000000 ffffcf80`00000001 ffffe000`01660000 : netkvm+0x8e8f
fffff802`00e696a0 fffff800`02198988 : ffffe000`01660000 fffff802`00e69829 00000000`00000000 fffff801`ff9de0e1 : netkvm+0x23e7
fffff802`00e69700 fffff800`006935f1 : 00000000`00000000 00001f80`00510074 00000000`00000000 00001f80`00d000d8 : netkvm+0xe988
fffff802`00e69760 fffff801`ff90ce90 : fffff802`00e69b20 ffffffff`ffe6db3c ffffe000`015b24d0 fffff802`00e69a20 : NDIS!ndisInterruptDpc+0x1b2
fffff802`00e69890 fffff801`ff90c111 : fffff801`ff9e5c8f 00000000`00000000 00000000`00000000 00000000`00000000 : nt!KiExecuteAllDpcs+0x1b0
fffff802`00e699e0 fffff801`ff9dcbea : fffff801`ffb7a180 fffff801`ffb7a180 fffff801`ffbd2a80 ffffe000`0013c880 : nt!KiRetireDpcList+0xe1
fffff802`00e69c60 00000000`00000000 : fffff802`00e6a000 fffff802`00e64000 00000000`00000000 00000000`00000000 : nt!KiIdleLoop+0x5a


STACK_COMMAND:  kb

FOLLOWUP_IP: 
netkvm+862c
fffff800`0219262c 4883c428        add     rsp,28h

SYMBOL_STACK_INDEX:  5

SYMBOL_NAME:  netkvm+862c

FOLLOWUP_NAME:  MachineOwner

MODULE_NAME: netkvm

IMAGE_NAME:  netkvm.sys

DEBUG_FLR_IMAGE_TIMESTAMP:  59133035

FAILURE_BUCKET_ID:  AV_VRF_netkvm+862c

BUCKET_ID:  AV_VRF_netkvm+862c

ANALYSIS_SOURCE:  KM

FAILURE_ID_HASH_STRING:  km:av_vrf_netkvm+862c

FAILURE_ID_HASH:  {f2a79b88-caf4-0024-7aa2-ea7514bb5c17}

Followup: MachineOwner
---------

Comment 5 Sameeh Jubran 2017-07-09 10:03:48 UTC
Hi Lijin,

I am unable to reproduce this at all (with 139 and 137), can you provide me the dump file?

Thanks

Comment 6 lijin 2017-07-11 02:19:58 UTC
(In reply to Sameeh Jubran from comment #5)
> Hi Lijin,
> 
> I am unable to reproduce this at all (with 139 and 137), can you provide me
> the dump file?
> 
> Thanks

Please find the dump file in comment#3

Comment 7 Ladi Prosek 2017-07-11 13:59:55 UTC
Is it possible that multiple threads race on the CNB::m_SGL field? One thread calls CNB::ReleaseResources, another one CNB::~CNB. Both see non-NULL m_SGL and both call NdisMFreeNetBufferSGList.

Specifically, what if two threads run CParaNdisTX::DoPendingTasks at about the same time and the same CNB is added to completedNBLs by one thread and to nbToFree by the other (under the lock) and then the two run in parallel. Perhaps live migration provides just the right timing conditions (the send fails and the buffer is returned by QEMU immediately?) This code seems to have been reworked in 7.4, which would explain the regression.

Comment 8 ybendito 2017-07-17 23:51:10 UTC
Created attachment 1300165 [details]
Fix RC1 for 2012R2

Comment 9 ybendito 2017-07-17 23:57:56 UTC
Please try using attached driver (for 2012r2-64), I hope it will fix the problem.
Scratch build for all other OS https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=13672544

If the problem will not be reproduced after several attempts, please give more ones (I would suggest 20), otherwise please provide link to the dump file.

Comment 10 lijin 2017-07-18 03:18:41 UTC
Do ping-pong migration 20 times with the attached drivers,win2012R2 guest works normally after migration,no bsod

Comment 11 ybendito 2017-07-18 10:33:16 UTC
According to the dump analysis and further testing of fix candidate, the BSOD happens when upon migration the driver sends ARP packet where transmission of previous THE SAME arp packet still not completed. In the fix we never send the same packet, creating for each arp sending clone of prototype packet.

Comment 12 ybendito 2017-07-20 06:19:41 UTC
Fixed in virtio-win-prewhql-0.1-141

Comment 13 lijin 2017-07-20 07:56:53 UTC
Try with build 141,guest works well after migration.

So this issue has been fixed.

Comment 19 errata-xmlrpc 2018-04-10 06:31:38 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:0657