Bug 1458626
Summary: | [virtio-win][netkvm] win2012R2 BSOD after migration during netperf test | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | lijin <lijin> | ||||
Component: | virtio-win | Assignee: | ybendito | ||||
virtio-win sub component: | virtio-win-prewhql | QA Contact: | lijin <lijin> | ||||
Status: | CLOSED ERRATA | Docs Contact: | |||||
Severity: | high | ||||||
Priority: | high | CC: | ailan, jen, jherrman, lijin, mtessun, sjubran, wyu, ybendito | ||||
Version: | 7.4 | Keywords: | Regression, ZStream | ||||
Target Milestone: | rc | ||||||
Target Release: | --- | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||
Doc Text: |
Prior to this update, after migrating a Windows Server 2012 R2 guest, the guest in certain cases failed to boot and displayed a blue error screen. The bug has been fixed, and the described problem no longer occurs.
|
Story Points: | --- | ||||
Clone Of: | |||||||
: | 1473575 (view as bug list) | Environment: | |||||
Last Closed: | 2018-04-10 06:31:38 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | |||||||
Bug Blocks: | 1473046, 1473575 | ||||||
Attachments: |
|
Description
lijin
2017-06-05 02:48:20 UTC
0: kd> !analyze -v ******************************************************************************* * * * Bugcheck Analysis * * * ******************************************************************************* IRQL_NOT_LESS_OR_EQUAL (a) An attempt was made to access a pageable (or completely invalid) address at an interrupt request level (IRQL) that is too high. This is usually caused by drivers using improper addresses. If a kernel debugger is available get the stack backtrace. Arguments: Arg1: ffffcf8003c3ecd0, memory referenced Arg2: 0000000000000002, IRQL Arg3: 0000000000000000, bitfield : bit 0 : value 0 = read operation, 1 = write operation bit 3 : value 0 = not an execute operation, 1 = execute operation (only on chips which support this level of status) Arg4: fffff801ffef538f, address which referenced memory Debugging Details: ------------------ *** ERROR: Module load completed but symbols could not be loaded for netkvm.sys READ_ADDRESS: ffffcf8003c3ecd0 Special pool CURRENT_IRQL: 2 FAULTING_IP: nt!VfPutScatterGatherList+cb fffff801`ffef538f 458b37 mov r14d,dword ptr [r15] DEFAULT_BUCKET_ID: WIN8_DRIVER_FAULT BUGCHECK_STR: AV PROCESS_NAME: System ANALYSIS_VERSION: 6.3.9600.16384 (debuggers(dbg).130821-1623) amd64fre DPC_STACK_BASE: FFFFF80200E70FB0 TRAP_FRAME: fffff80200e69350 -- (.trap 0xfffff80200e69350) NOTE: The trap frame does not contain all registers. Some register values may be zeroed or incorrect. rax=ffffe00000c5f058 rbx=0000000000000000 rcx=0000000000000002 rdx=0000000000000060 rsi=0000000000000000 rdi=0000000000000000 rip=fffff801ffef538f rsp=fffff80200e694e0 rbp=ffffe00002231950 r8=0000000000000000 r9=fffff801ffb7a180 r10=0000000000000100 r11=fffff80200e694d8 r12=0000000000000000 r13=0000000000000000 r14=0000000000000000 r15=0000000000000000 iopl=0 nv up ei pl zr na po nc nt!VfPutScatterGatherList+0xcb: fffff801`ffef538f 458b37 mov r14d,dword ptr [r15] ds:00000000`00000000=???????? Resetting default scope LAST_CONTROL_TRANSFER: from fffff801ff9e4be9 to fffff801ff9d90a0 STACK_TEXT: fffff802`00e69208 fffff801`ff9e4be9 : 00000000`0000000a ffffcf80`03c3ecd0 00000000`00000002 00000000`00000000 : nt!KeBugCheckEx fffff802`00e69210 fffff801`ff9e343a : 00000000`00000000 ffffe000`01651f08 00000000`00000000 fffff802`00e69350 : nt!KiBugCheckDispatch+0x69 fffff802`00e69350 fffff801`ffef538f : ffffcf80`02d2cf50 ffffe000`01651ec0 ffffe000`01651ec0 ffffe000`000423f8 : nt!KiPageFault+0x23a fffff802`00e694e0 fffff800`0069af21 : ffffcf80`02d2cf50 fffff802`00e69639 fffff802`00e69608 fffff801`ff820d60 : nt!VfPutScatterGatherList+0xcb fffff802`00e69540 fffff800`0219262c : 00000000`00000000 fffff802`00e69608 ffffcf80`03d9ef30 fffff801`ffeff46a : NDIS!NdisMFreeNetBufferSGList+0x31 fffff802`00e69580 fffff800`02192213 : 00000000`00000002 fffff802`00e69639 fffff802`00e69618 00000000`00000000 : netkvm+0x862c fffff802`00e695b0 fffff800`02192e8f : ffffcf80`036ce560 00000000`00000000 ffffcf80`036ce480 fffff801`ffeff76b : netkvm+0x8213 fffff802`00e695e0 fffff800`0218c3e7 : ffffe000`00000000 00000000`00000000 ffffcf80`00000001 ffffe000`01660000 : netkvm+0x8e8f fffff802`00e696a0 fffff800`02198988 : ffffe000`01660000 fffff802`00e69829 00000000`00000000 fffff801`ff9de0e1 : netkvm+0x23e7 fffff802`00e69700 fffff800`006935f1 : 00000000`00000000 00001f80`00510074 00000000`00000000 00001f80`00d000d8 : netkvm+0xe988 fffff802`00e69760 fffff801`ff90ce90 : fffff802`00e69b20 ffffffff`ffe6db3c ffffe000`015b24d0 fffff802`00e69a20 : NDIS!ndisInterruptDpc+0x1b2 fffff802`00e69890 fffff801`ff90c111 : fffff801`ff9e5c8f 00000000`00000000 00000000`00000000 00000000`00000000 : nt!KiExecuteAllDpcs+0x1b0 fffff802`00e699e0 fffff801`ff9dcbea : fffff801`ffb7a180 fffff801`ffb7a180 fffff801`ffbd2a80 ffffe000`0013c880 : nt!KiRetireDpcList+0xe1 fffff802`00e69c60 00000000`00000000 : fffff802`00e6a000 fffff802`00e64000 00000000`00000000 00000000`00000000 : nt!KiIdleLoop+0x5a STACK_COMMAND: kb FOLLOWUP_IP: netkvm+862c fffff800`0219262c 4883c428 add rsp,28h SYMBOL_STACK_INDEX: 5 SYMBOL_NAME: netkvm+862c FOLLOWUP_NAME: MachineOwner MODULE_NAME: netkvm IMAGE_NAME: netkvm.sys DEBUG_FLR_IMAGE_TIMESTAMP: 59133035 FAILURE_BUCKET_ID: AV_VRF_netkvm+862c BUCKET_ID: AV_VRF_netkvm+862c ANALYSIS_SOURCE: KM FAILURE_ID_HASH_STRING: km:av_vrf_netkvm+862c FAILURE_ID_HASH: {f2a79b88-caf4-0024-7aa2-ea7514bb5c17} Followup: MachineOwner --------- Hi Lijin, I am unable to reproduce this at all (with 139 and 137), can you provide me the dump file? Thanks (In reply to Sameeh Jubran from comment #5) > Hi Lijin, > > I am unable to reproduce this at all (with 139 and 137), can you provide me > the dump file? > > Thanks Please find the dump file in comment#3 Is it possible that multiple threads race on the CNB::m_SGL field? One thread calls CNB::ReleaseResources, another one CNB::~CNB. Both see non-NULL m_SGL and both call NdisMFreeNetBufferSGList. Specifically, what if two threads run CParaNdisTX::DoPendingTasks at about the same time and the same CNB is added to completedNBLs by one thread and to nbToFree by the other (under the lock) and then the two run in parallel. Perhaps live migration provides just the right timing conditions (the send fails and the buffer is returned by QEMU immediately?) This code seems to have been reworked in 7.4, which would explain the regression. Created attachment 1300165 [details]
Fix RC1 for 2012R2
Please try using attached driver (for 2012r2-64), I hope it will fix the problem. Scratch build for all other OS https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=13672544 If the problem will not be reproduced after several attempts, please give more ones (I would suggest 20), otherwise please provide link to the dump file. Do ping-pong migration 20 times with the attached drivers,win2012R2 guest works normally after migration,no bsod According to the dump analysis and further testing of fix candidate, the BSOD happens when upon migration the driver sends ARP packet where transmission of previous THE SAME arp packet still not completed. In the fix we never send the same packet, creating for each arp sending clone of prototype packet. Fixed in virtio-win-prewhql-0.1-141 Try with build 141,guest works well after migration. So this issue has been fixed. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:0657 |