Bug 1443019 - [virtio-win][qemupciserial] job "PNP Rebanlance RequestNew Resources Device Test" and other two jobs failed on win10+ guests
Summary: [virtio-win][qemupciserial] job "PNP Rebanlance RequestNew Resources Device T...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: virtio-win
Version: 7.4
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: rc
: ---
Assignee: Ladi Prosek
QA Contact: Virtualization Bugs
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-04-18 10:09 UTC by Yu Wang
Modified: 2017-08-04 04:55 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-08-01 12:58:08 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
133-pass (6.66 MB, application/zip)
2017-04-18 10:09 UTC, Yu Wang
no flags Details
135-fail (6.40 MB, application/zip)
2017-04-18 10:12 UTC, Yu Wang
no flags Details
com-error (170.76 KB, image/png)
2017-04-19 05:56 UTC, Yu Wang
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2017:2341 0 normal SHIPPED_LIVE virtio-win bug fix and enhancement update 2017-08-01 16:52:38 UTC

Description Yu Wang 2017-04-18 10:09:28 UTC
Created attachment 1272265 [details]
133-pass

Description of problem:
 job "PNP Rebanlance RequestNew Resources Device Test" and other two jobs failed on win10+ guestsss

Version-Release number of selected component (if applicable):
virtio-win-prewhql-133/135
qemu-kvm-rhev-2.8.0-5/6.el7.x86_64
kernel-3.10.0-574/634.el7.x86_64
seabios-1.10.1-2.el7.x86_64

How reproducible:
100%

Steps to Reproduce:
1. boot guest with qemupciserial
/usr/libexec/qemu-kvm -name 133QSRW10S64TGQ -enable-kvm -m 6G -smp 8 -uuid f21be03e-dc00-4892-acab-1e88c3303ad6 -nodefconfig -nodefaults -chardev socket,id=charmonitor,path=/tmp/133QSRW10S64TGQ,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=localtime,driftfix=slew -boot order=cd,menu=on -device piix3-usb-uhci,id=usb -drive file=133QSRW10S64TGQ,if=none,id=drive-ide0-0-0,format=raw,serial=mike_cao,cache=none -device ide-drive,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0 -drive file=en_windows_server_2016_x64_dvd_9327751.iso,if=none,media=cdrom,id=drive-ide0-1-0,readonly=on,format=raw -device ide-drive,bus=ide.1,unit=0,drive=drive-ide0-1-0,id=ide0-1-0 -drive file=133QSRW10S64TGQ.vfd,if=floppy,id=drive-fdc0-0-0,format=raw,cache=none -netdev tap,script=/etc/qemu-ifup,downscript=no,id=hostnet0 -device e1000,netdev=hostnet0,id=net0,mac=00:52:6a:42:d4:3e -chardev pty,id=charserial0 -device isa-serial,chardev=charserial0,id=isa_serial0 -device usb-tablet,id=input0 -vnc 0.0.0.0:1 -vga std -M pc -chardev socket,path=/tmp/133QSRW10S64TGQ_serial0,server,nowait,id=serial0 -device pci-serial,chardev=serial0,id=pciserial0

2.submit job

Actual results:
still failed with filter

Expected results:
it could be filter pass or pass 


Additional info:
1 it could pass with filter before, but it still cannot pass with former version(as below)
virtio-win-prewhql-133
qemu-kvm-rhev-2.8.0-5.el7.x86_64
kernel-3.10.0-574.el7.x86_64
seabios-1.10.1-2.el7.x86_64

2 tried on 3 hlk server ,still cannot pass

3 logs for build-133(pass) and build-135(fail) refer to attachment. The same error, but could not pass now.

4 jobs failed as below
  DF - PNP Stop (Rebalance) Device Test (Reliability) failed on win10-32/64/ws2016
  DF - PNP Rebalance Request New Resources Device Test (Reliability) failed on win10-32/64/ws2016
  DF - PNP Remove Device Test (Reliability) only failed on win2016

Comment 2 Yu Wang 2017-04-18 10:12:19 UTC
Created attachment 1272267 [details]
135-fail

The same error as build 133,but could not filter pass this time. and now it cannot pass with build 133 too.

WDTF_SIMPLE_IO : - Open(Communications Port (COM3) MF\PCI#VEN_1B36&DEV_0002&SUBSYS_11001AF4&REV_01\4&3227F39&0&28#CHILD0000 ) Failed : Device is reporting a problem code (Status Flags=0x1806400 (DN_HAS_PROBLEM DN_DISABLEABLE DN_REMOVABLE DN_NT_ENUMERATOR DN_NT_DRIVER) Problem Code=a (CM_PROB_FAILED_START)) HRESULT=0x80004005

WDTF_SIMPLE_IO : Device Status: Status Flags=0x1806400 (DN_HAS_PROBLEM DN_DISABLEABLE DN_REMOVABLE DN_NT_ENUMERATOR DN_NT_DRIVER) Problem Code=a (CM_PROB_FAILED_START)

Comment 3 Yu Wang 2017-04-19 05:56:21 UTC
Created attachment 1272479 [details]
com-error

When running jobs in comment#0, it occurred error (Code10) as screenshot
Insufficient system resources exist to complete the API

Thanks
Yu Wang

Comment 5 lijin 2017-05-02 06:06:37 UTC
Any update about this bug?

Comment 6 Ladi Prosek 2017-05-02 14:52:01 UTC
I was able to reproduce this and have a theory on what might be wrong. Gal, feel free to re-assign to me.

Comment 8 Ladi Prosek 2017-05-03 12:41:26 UTC
Debugging notes:
I can reproduce this consistently by running the test named "DF - PNP Rebalance Request New Resources Device Test (Reliability)". In the middle of the test after resources are rebalanced the COMn device gets the yellow exclamation point and reports Code 10, insufficient resources.

Running "info pci" in HMP, I see the IRQ number changing and the I/O BAR staying the same. Originally I thought that the problem is in QEMU not catching the IRQ update. Unlike the multi-port PCI serial device, the single-port one doesn't use pci_set_irq to set the interrupt so it doesn't (AFAICT) re-read the relevant part of the PCI config space. Sadly, fixing that didn't help.

Fortunately serial.sys shipped with Windows was compiled with some debugging info so I could do:

  0: kd> bp serial!SerialDbgPrintEx "da rdx; g"

to get crude debug print outs and see the code flow. The second invocation of serial!SerialFinishStartDevice didn't seem to run to completion and sure enough, tracing the function confirmed that serial!SerialGetPortInfo called at this callstack:

  serial!SerialFinishStartDevice+0x2b0
  serial!SerialStartDevice+0xcf
  serial!SerialPnpDispatch+0x390
  nt!IovCallDriver+0x252

returned c000009a (STATUS_INSUFFICIENT_RESOURCES).

Comment 9 Ladi Prosek 2017-05-03 14:43:14 UTC
STATUS_INSUFFICIENT_RESOURCES is caused by the IRP not having the expected non-NULL parameters. Specifically for IRP_MN_START_DEVICE, arg1 and arg2 of the current IO stack location should contain AllocatedResources and AllocatedResourcesTranslated but are both NULL.

Let's see what kind of power IRPs the driver is getting:

  0: kd> bp serial!SerialPnpDispatch "!irp @rdx; g"

...

First IRP_MN_START_DEVICE (all is good):

Irp is active with 3 stacks 2 is current (= 0xffffd8843744cf28)
 No Mdl: No System Buffer: Thread ffff9d88eb634040:  Irp stack trace.  
     cmd  flg cl Device   File     Completion-Context
 [N/A(0), N/A(0)]
            0 10 00000000 00000000 00000000-00000000    

			Args: 00000000 00000000 00000000 00000000
>[IRP_MJ_PNP(1b), IRP_MN_START_DEVICE(0)]
            0 e0 ffff9d88eb6f3070 00000000 fffff808443c1200-ffffc800a30c8628 Success Error Cancel 
	       \Driver\Serial	serenum!SerenumSyncCompletion
			Args: ffff88046d5c3690 ffff88046d6f8db0 00000000 00000000
 [IRP_MJ_PNP(1b), IRP_MN_START_DEVICE(0)]
            0 e0 ffff9d88eb6f3d60 00000000 fffff80200f73088-ffff9d88ebc27400 Success Error Cancel 
	       \Driver\Serenum	nt!PnpDeviceCompletionRoutine
			Args: ffff88046d5c3690 ffff88046d6f8db0 00000000 00000000

...

Second IRP_MN_START_DEVICE (device fails to start):

Irp is active with 3 stacks 2 is current (= 0xffffd884373c6f28)
 No Mdl: No System Buffer: Thread ffff9d88ebef2040:  Irp stack trace.  
     cmd  flg cl Device   File     Completion-Context
 [N/A(0), N/A(0)]
            0 10 00000000 00000000 00000000-00000000    

			Args: 00000000 00000000 00000000 00000000
>[IRP_MJ_PNP(1b), IRP_MN_START_DEVICE(0)]
            0 e0 ffff9d88eb6f3070 00000000 fffff808443c1200-ffffc800a3662628 Success Error Cancel 
	       \Driver\Serial	serenum!SerenumSyncCompletion
			Args: 00000000 00000000 00000000 00000000
 [IRP_MJ_PNP(1b), IRP_MN_START_DEVICE(0)]
            0 e0 ffff9d88eb6f3d60 00000000 fffff80200f73088-ffff9d88e9e95880 Success Error Cancel 
	       \Driver\Serenum	nt!PnpDeviceCompletionRoutine
			Args: 00000000 00000000 00000000 00000000


They look the same except for the missing Args.

Comment 10 Ladi Prosek 2017-05-03 15:04:50 UTC
Posted a question on the ntdev list:
http://www.osronline.com/showthread.cfm?link=283584

Comment 11 Ladi Prosek 2017-05-04 09:43:39 UTC
I have tried assigning resources to the port with a simple:

  HKR,Child0000,ResourceMap,1,00,01,02

instead of:

  HKR,Child0000,VaryingResourceMap,1,00, 00,00,00,00, 08,00,00,00
  HKR,Child0000,ResourceMap,1,02

but it didn't help.

Comment 12 Ladi Prosek 2017-05-05 13:37:45 UTC
This is very likely a Windows bug.

Here's a brief overview of the architecture of the qemupciserial driver:

We ship qemupciserial.inf which references the in-box Windows MF (multi-function) driver and provides a recipe for splitting resources among individual UARTs. This is what the VaryingResourceMap and ResourceMap entries are for. They are read by MF.sys which acts like a bus driver and enumerates a PNP0501 device for each UART, then driven by serial.sys. There is a Windows-internal and undocumented interface between MF.sys and serial.sys to communicate this resource allocation.

In order for these HLK tests to pass, MF.sys must be able to handle resource rebalancing and do the right thing with respect to it's child devices. And that seems to be broken.

I have found reports of MF.sys crashing:
https://social.msdn.microsoft.com/Forums/en-US/1003a2be-3463-4601-ae91-55cacc29904c/bsod-when-disabling-or-uninstalling-mfsys

and have experienced a verifier violation in MF.sys myself:
https://www.osronline.com/showthread.cfm?link=283584#T5

So blaming MF.sys for this is a plausible theory.

Unfortunately we can't write our own bus driver to replace MF because of its undocumented resource arbitration protocol. If we wanted to write something, it would have to be a full UART driver and that is not trivial.

But, luckily, we support only the 1x flavor of the QEMU PCI serial device so in theory we should be able to make serial.sys drive the UART without MF.sys. That's the next thing to try.

Comment 15 Ladi Prosek 2017-05-10 07:03:41 UTC
An .inf with no MF.sys dependency appears to work and passes the test for me.

Fix committed as:

https://github.com/virtio-win/kvm-guest-drivers-windows/commit/539da1a0f8f1d233051f90fef3ee620527c946e6

Note that we now build two copies of the qemupciserial driver. The upstream one supports all three devices and stays in the same location (root of the internal pre-WHQL build). The RHEL driver supports only the 1x device and will be dropped to a new 'rhel' directory in the internal pre-WHQL build. Please use the one in 'rhel' for testing.

Comment 20 lijin 2017-05-18 05:55:27 UTC
With build 137,all whql jobs passed
Thanks,Ladi

So change status to verified.

Comment 25 errata-xmlrpc 2017-08-01 12:58:08 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:2341


Note You need to log in before you can comment on or make changes to this bug.