Bug 1599631

Summary: [virtio-win][netkvm][whql] Job "NDISTest 6.0 - [2 Machine] - 2c_Mini6RSSSendRecv (Multi-Group Win8+)" BSOD with build154/156
Product: Red Hat Enterprise Linux 7 Reporter: Yu Wang <wyu>
Component: virtio-winAssignee: Sameeh Jubran <sjubran>
virtio-win sub component: virtio-win-prewhql QA Contact: Virtualization Bugs <virt-bugs>
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: high CC: ailan, ddepaula, lijin, michen, phou, sjubran, vrozenfe, xiagao, yvugenfi
Version: 7.6Keywords: Regression
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
NO_DOCS
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-10-30 16:21:51 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Win8/10 builds without the event suppression feature. none

Description Yu Wang 2018-07-10 08:58:57 UTC
Description of problem:


Version-Release number of selected component (if applicable):
kernel-3.10.0-919.el7.x86_64
virtio-win-prewhql-154/156
qemu-kvm-rhev-2.12.0-7.el7.x86_64
seabios-bin-1.11.0-2.el7.noarch

How reproducible:
2/2

Steps to Reproduce:
1. boot guest with virto-net device
2. submit job 
3.

Actual results:
Failed as BSOD

Expected results:
Pass

Additional info:
1 can pass with RHEL7.5 release build (build144), it is a regression
2 Failed on both rhel7 and rhel8 host.

Comment 2 Yu Wang 2018-07-10 09:34:10 UTC
*******************************************************************************
*                                                                             *
*                        Bugcheck Analysis                                    *
*                                                                             *
*******************************************************************************

SYSTEM_THREAD_EXCEPTION_NOT_HANDLED (7e)
This is a very common bugcheck.  Usually the exception address pinpoints
the driver/function that caused the problem.  Always note this address
as well as the link date of the driver/image that contains this address.
Arguments:
Arg1: ffffffff80000003, The exception code that was not handled
Arg2: fffff800cc4e3d29, The address that the exception occurred at
Arg3: ffffd000b87b10d8, Exception Record Address
Arg4: ffffd000b87b08e0, Context Record Address

Debugging Details:
------------------


EXCEPTION_CODE: (HRESULT) 0x80000003 (2147483651) - One or more arguments are invalid

FAULTING_IP: 
NDProt630+a5d29
fffff800`cc4e3d29 cc              int     3

EXCEPTION_RECORD:  ffffd000b87b10d8 -- (.exr 0xffffd000b87b10d8)
ExceptionAddress: fffff800cc4e3d29 (NDProt630+0x00000000000a5d29)
   ExceptionCode: 80000003 (Break instruction exception)
  ExceptionFlags: 00000000
NumberParameters: 1
   Parameter[0]: 0000000000000000

CONTEXT:  ffffd000b87b08e0 -- (.cxr 0xffffd000b87b08e0;r)
rax=0000000000000000 rbx=ffffe0018e6c8080 rcx=a42658efbf840000
rdx=0000000000000000 rsi=ffffe0018e6c8080 rdi=ffffe0018c85d580
rip=fffff800cc4e3d29 rsp=ffffd000b87b1310 rbp=0000000000000080
 r8=0000000000000000  r9=ffffd000b87b0d00 r10=00000000fffffffd
r11=0000000000000000 r12=0000000000000000 r13=fffff802dd81e000
r14=ffffe0018ebafad8 r15=fffff800cc50a0a0
iopl=0         nv up ei ng nz na pe nc
cs=0010  ss=0018  ds=002b  es=002b  fs=0053  gs=002b             efl=00000282
NDProt630+0xa5d29:
fffff800`cc4e3d29 cc              int     3
Last set context:
rax=0000000000000000 rbx=ffffe0018e6c8080 rcx=a42658efbf840000
rdx=0000000000000000 rsi=ffffe0018e6c8080 rdi=ffffe0018c85d580
rip=fffff800cc4e3d29 rsp=ffffd000b87b1310 rbp=0000000000000080
 r8=0000000000000000  r9=ffffd000b87b0d00 r10=00000000fffffffd
r11=0000000000000000 r12=0000000000000000 r13=fffff802dd81e000
r14=ffffe0018ebafad8 r15=fffff800cc50a0a0
iopl=0         nv up ei ng nz na pe nc
cs=0010  ss=0018  ds=002b  es=002b  fs=0053  gs=002b             efl=00000282
NDProt630+0xa5d29:
fffff800`cc4e3d29 cc              int     3
Resetting default scope

DEFAULT_BUCKET_ID:  WIN8_DRIVER_FAULT

BUGCHECK_STR:  AV

PROCESS_NAME:  System

CURRENT_IRQL:  0

ERROR_CODE: (NTSTATUS) 0x80000003 - {EXCEPTION}  Breakpoint  A breakpoint has been reached.

EXCEPTION_PARAMETER1:  0000000000000000

ANALYSIS_VERSION: 6.3.9600.16520 (debuggers(dbg).140127-0329) amd64fre

LAST_CONTROL_TRANSFER:  from fffff800cc4ea600 to fffff800cc4e3d29

STACK_TEXT:  
ffffd000`b87b1310 fffff800`cc4ea600 : fffff800`cc5be930 ffffe001`00000380 fffff800`cc5bf4f0 00000000`00003c00 : NDProt630+0xa5d29
ffffd000`b87b1350 fffff800`cc50a0f3 : ffffe001`8ebafaa8 00000000`00000001 00000000`00000000 00000000`00006a12 : NDProt630+0xac600
ffffd000`b87b1430 fffff802`dd91fc70 : ffffe001`8ebafad8 fffff960`000dfeed fffff901`42289e80 fffff960`000eabb1 : NDProt630+0xcc0f3
ffffd000`b87b1480 fffff802`dd974fc6 : fffff802`ddb21180 ffffe001`8e6c8080 fffff802`ddb7aa00 fffff802`dd882cb2 : nt!PspSystemThreadStartup+0x58
ffffd000`b87b14e0 00000000`00000000 : ffffd000`b87b2000 ffffd000`b87ab000 00000000`00000000 00000000`00000000 : nt!KiStartSystemThread+0x16


FOLLOWUP_IP: 
NDProt630+a5d29
fffff800`cc4e3d29 cc              int     3

SYMBOL_STACK_INDEX:  0

SYMBOL_NAME:  NDProt630+a5d29

FOLLOWUP_NAME:  MachineOwner

MODULE_NAME: NDProt630

IMAGE_NAME:  NDProt630.sys

DEBUG_FLR_IMAGE_TIMESTAMP:  550cea5c

STACK_COMMAND:  .cxr 0xffffd000b87b08e0 ; kb

FAILURE_BUCKET_ID:  AV_VRF_NDProt630+a5d29

BUCKET_ID:  AV_VRF_NDProt630+a5d29

ANALYSIS_SOURCE:  KM

FAILURE_ID_HASH_STRING:  km:av_vrf_ndprot630+a5d29

FAILURE_ID_HASH:  {a0550f62-2bda-baa3-4f2b-7854cdb7064d}

Followup: MachineOwner
---------

Comment 7 Sameeh Jubran 2018-07-17 12:40:46 UTC
From the investigation I've made so far, it seems like the device is not notifying the driver that it has finished sending packets.
Can you reproduce with vhost = off?
Can you reproduce with qemu 2.9 for example?

Comment 8 Sameeh Jubran 2018-07-17 13:08:25 UTC
(In reply to Sameeh Jubran from comment #7)
> From the investigation I've made so far, it seems like the device is not
> notifying the driver that it has finished sending packets.
> Can you reproduce with vhost = off?
> Can you reproduce with qemu 2.9 for example?

More questions:

* Did you try build 144 on this same setup? (7.6) Does it pass?
* Did you try the build 156 on the previous setup?

Comment 9 Yu Wang 2018-07-18 06:01:53 UTC
(In reply to Sameeh Jubran from comment #8)
> (In reply to Sameeh Jubran from comment #7)
> > From the investigation I've made so far, it seems like the device is not
> > notifying the driver that it has finished sending packets.
> > Can you reproduce with vhost = off?
> > Can you reproduce with qemu 2.9 for example?

I will try it, then will tell you the result. You can refer to the answer below first.

> 
> More questions:
> 
> * Did you try build 144 on this same setup? (7.6) Does it pass?
> * Did you try the build 156 on the previous setup?

As I said in comment#0,

It can pass with RHEL7.5 release build (build144), it is a regression
The setup is the same(vhost=on,qemu-kvm-rhev-2.12.0-7.el7.x86_64).

Thanks
Yu Wang

Comment 10 Yu Wang 2018-07-18 09:59:44 UTC
(In reply to Sameeh Jubran from comment #8)

> > Can you reproduce with vhost = off?

I can pass this job with vhost=off

Thanks
Yu Wang

Comment 11 Sameeh Jubran 2018-07-19 15:54:31 UTC
Created attachment 1460861 [details]
Win8/10 builds without the event suppression feature.

I have created a build with a disabled feature of the virtio queue, this might resolve the issue... can you please test if the BSOD reproduces with this build and vhost=on.

Thanks!

Comment 12 Yu Wang 2018-07-20 05:34:05 UTC
(In reply to Sameeh Jubran from comment #11)
> Created attachment 1460861 [details]
> Win8/10 builds without the event suppression feature.
> 
> I have created a build with a disabled feature of the virtio queue, this
> might resolve the issue... can you please test if the BSOD reproduces with
> this build and vhost=on.

It can pass without BSOD using your temp driver build

Thanks
Yu Wang


> 
> Thanks!

Comment 13 Sameeh Jubran 2018-07-24 00:22:17 UTC
(In reply to Yu Wang from comment #12)
> (In reply to Sameeh Jubran from comment #11)
> > Created attachment 1460861 [details]
> > Win8/10 builds without the event suppression feature.
> > 
> > I have created a build with a disabled feature of the virtio queue, this
> > might resolve the issue... can you please test if the BSOD reproduces with
> > this build and vhost=on.
> 
> It can pass without BSOD using your temp driver build
> 
> Thanks
> Yu Wang
> 
> 
> > 
> > Thanks!

Can you still pass the temp build with vhost on and one virtqueue? if no, then can build 144 pass this?

Comment 14 Sameeh Jubran 2018-07-24 16:26:35 UTC
(In reply to Sameeh Jubran from comment #13)
> (In reply to Yu Wang from comment #12)
> > (In reply to Sameeh Jubran from comment #11)
> > > Created attachment 1460861 [details]
> > > Win8/10 builds without the event suppression feature.
> > > 
> > > I have created a build with a disabled feature of the virtio queue, this
> > > might resolve the issue... can you please test if the BSOD reproduces with
> > > this build and vhost=on.
> > 
> > It can pass without BSOD using your temp driver build
> > 
> > Thanks
> > Yu Wang
> > 
> > 
> > > 
> > > Thanks!
> 
> Can you still pass the temp build with vhost on and one virtqueue? if no,
> then can build 144 pass this?

Can you please test the temp build on all other tests, since i can't test this on my setup as it tends to always fail with BSOD, it may be caused by the newer kernel I am using.

Comment 15 Yu Wang 2018-07-25 07:32:51 UTC
Hi, 

>Can you please test the temp build on all other tests, since i can't test this on my setup as it tends to always fail with BSOD, it may be caused by the newer kernel I am using.

I will test this later.

I recently ran this case with build157, and it pass without BSOD, 
but it shows "qemu-kvm: unable to start vhost net: 14: falling back on userspace virtio". Seems that there is a bug to set vhost=on, I reported a bug as below:

Bug 1608226 - [virtual-network] prompt warning "qemu-kvm: unable to start vhost net: 14: falling back on userspace virtio" when boot with win8+ guests 


Thanks
Yu Wang

Comment 16 Yu Wang 2018-07-25 09:56:47 UTC
Summary :

When boot with guest with single queue,vhost=on: it occurred BSOD.(tried on build156)

When boot with mq,vhost=on: will occurred error "qemu-kvm: unable to start vhost net: 14: falling back on userspace virtio". Seems that there is a bug to set vhost=on", but can PASS this job.(tried on build157)

For tmp build tests, it can pass with vhost=off.

Thanks
Yu Wang

Comment 17 Yu Wang 2018-07-25 10:07:52 UTC
(In reply to Sameeh Jubran from comment #14)

> 
> Can you please test the temp build on all other tests, since i can't test
> this on my setup as it tends to always fail with BSOD, it may be caused by
> the newer kernel I am using.

run all tests with multi-queue or single queue?


Thanks
Yu Wang

Comment 18 Sameeh Jubran 2018-07-25 10:59:07 UTC
(In reply to Yu Wang from comment #17)
> (In reply to Sameeh Jubran from comment #14)
> 
> > 
> > Can you please test the temp build on all other tests, since i can't test
> > this on my setup as it tends to always fail with BSOD, it may be caused by
> > the newer kernel I am using.
> 
> run all tests with multi-queue or single queue?
Multiqueue please
> 
> 
> Thanks
> Yu Wang

Comment 20 Sameeh Jubran 2018-07-25 22:54:28 UTC
For Win10 we have an errata
https://bugzilla.redhat.com/show_bug.cgi?id=1367251#c11

and for the test itself to pass the following should be done:

Mini6RSSSendRecv (Multi-Group Win8+) test
Right after the initial reboot on test initiation (Before the test itself starts!), enter the command prompt as the Administrator, and type:

bcdedit.exe /set groupaware off
bcdedit.exe /deletevalue groupsize
shutdown /r /t 0 /f

Comment 22 Sameeh Jubran 2018-07-27 13:23:07 UTC
(In reply to Sameeh Jubran from comment #20)
> For Win10 we have an errata
> https://bugzilla.redhat.com/show_bug.cgi?id=1367251#c11
> 
> and for the test itself to pass the following should be done:
> 
> Mini6RSSSendRecv (Multi-Group Win8+) test
> Right after the initial reboot on test initiation (Before the test itself
> starts!), enter the command prompt as the Administrator, and type:
> 
> bcdedit.exe /set groupaware off
> bcdedit.exe /deletevalue groupsize
> shutdown /r /t 0 /f

Thanks to Yu help in reproducing the issue and testing possible fixes, I have identified the offending commit and added a pull request:
https://github.com/virtio-win/kvm-guest-drivers-windows/pull/317

The commit should make it to the next build, I have already informed vadim to add it.

Comment 23 Yu Wang 2018-07-31 09:44:08 UTC
Ran this job with build 159

1 with 1 queue, vhost=on
Pass at the first time.

2 with mq and vhost=on:
pass at the second time, the first time BSOD(7e and IMAGE_NAME:  NDProt630.sys , same as comment#2)

Can this be counted as fixed ?

Thanks
Yu Wang

Comment 24 Sameeh Jubran 2018-08-05 09:25:16 UTC
(In reply to Yu Wang from comment #23)
> Ran this job with build 159
> 
> 1 with 1 queue, vhost=on
> Pass at the first time.
> 
> 2 with mq and vhost=on:
> pass at the second time, the first time BSOD(7e and IMAGE_NAME: 
> NDProt630.sys , same as comment#2)
> 
> Can this be counted as fixed ?
> 
> Thanks
> Yu Wang

Can you please supply me with the BSOD?

and yes let's count this as fixed for now as we already identified the offending commit. This might be a different issue.

Comment 28 lijin 2018-09-17 02:21:26 UTC
Hi Danilo,

This bug also need to be added into rhel7.6 virtio-win errata, could you help to do it?

Thanks a lot

Comment 29 Danilo de Paula 2018-09-19 12:04:37 UTC
It's already there.

Comment 31 errata-xmlrpc 2018-10-30 16:21:51 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:3413