Bug 689748 - WHQL BLK 2k8-32bit Common Scenario Stress With IO fail
Summary: WHQL BLK 2k8-32bit Common Scenario Stress With IO fail
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: xenpv-win
Version: 5.7
Hardware: Unspecified
OS: Unspecified
high
medium
Target Milestone: rc
: ---
Assignee: Paolo Bonzini
QA Contact: Virtualization Bugs
URL:
Whiteboard:
Depends On:
Blocks: 518435
TreeView+ depends on / blocked
 
Reported: 2011-03-22 11:07 UTC by Huang Wenlong
Modified: 2013-10-20 21:42 UTC (History)
6 users (show)

Fixed In Version: xenpv-win-1.3.4-7.el5
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2011-06-08 08:15:43 UTC
Target Upstream Version:


Attachments (Terms of Use)
copy of the driver with extra debugging output going to "xm dmesg" (31.71 KB, patch)
2011-03-25 18:01 UTC, Paolo Bonzini
no flags Details | Diff
system log (4.07 MB, application/octet-stream)
2011-03-28 02:24 UTC, Huang Wenlong
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2011:0853 0 normal SHIPPED_LIVE xenpv-win bug fix update 2011-06-08 08:15:13 UTC

Description Huang Wenlong 2011-03-22 11:07:21 UTC
Description of problem:

2k8-32 blk Common Scenario Stress With IO fail  with EnumerateDeviceOverride=3 and add the second disk mirror in the first disk 

Version-Release number of selected component (if applicable):
xenpv-win-1.3.4-6.el5.rpm
kernel-xen-2.6.18-245.el5.x86_64.rpm
xen-3.0.3-123.el5.x86_64.rpm


How reproducible:
100%

Steps to Reproduce:
1.install xenpv-win driver 
2.set register EnumerateDevicesOverride = 3
3.set the second disk mirror in first disk 
4.run test job
  
Actual results:
job pass 

Expected results:
job fail

Additional info:
2k8-64bit job pass

Comment 1 Paolo Bonzini 2011-03-25 17:59:25 UTC
After running the test multiple times we noticed that the failure is always at or around the 100th cycle, and is due to the PCI device changing to problem code 10 ("the device failed to start").  This may be a bug in xen or, more likely, an OS-specific problem.

The event log (I think the System Log) should have quite a few error messages about the failure, but I couldn't access it from DTM Studio with remote desktop.  If you can find them, you can export them and send it here (Start -> Run -> eventvwr.msc, then right click on System -> Save log file as...).

If it doesn't, you can try installing the attached driver file (just copy it into c:\windows\system32\drivers) and create the following file in dom0

while sleep 1; do
  xm dmesg | sed '/^$/d'
  xm dmesg -c > /dev/null
done

let's call it log-dmesg.  Now do

chmod +x log-dmesg
./log-dmesg > bz689748-dmesg.log

run the test and attach the resulting file here.  Thanks!

Comment 2 Paolo Bonzini 2011-03-25 18:01:17 UTC
Created attachment 487629 [details]
copy of the driver with extra debugging output going to "xm dmesg"

Comment 3 Huang Wenlong 2011-03-28 02:23:38 UTC
(In reply to comment #1)
> After running the test multiple times we noticed that the failure is always at
> or around the 100th cycle, and is due to the PCI device changing to problem
> code 10 ("the device failed to start").  This may be a bug in xen or, more
> likely, an OS-specific problem.
> 
> The event log (I think the System Log) should have quite a few error messages
> about the failure, but I couldn't access it from DTM Studio with remote
> desktop.  If you can find them, you can export them and send it here (Start ->
> Run -> eventvwr.msc, then right click on System -> Save log file as...).
> 
> If it doesn't, you can try installing the attached driver file (just copy it
> into c:\windows\system32\drivers) and create the following file in dom0
> 
> while sleep 1; do
>   xm dmesg | sed '/^$/d'
>   xm dmesg -c > /dev/null
> done
> 
> let's call it log-dmesg.  Now do
> 
> chmod +x log-dmesg
> ./log-dmesg > bz689748-dmesg.log
> 
> run the test and attach the resulting file here.  Thanks!
Hi,Paolo  

I attach  the system log of Event Viewer in 2k8-32.

Wenlong

Comment 4 Huang Wenlong 2011-03-28 02:24:02 UTC
Created attachment 488067 [details]
system log

Comment 5 Paolo Bonzini 2011-03-28 18:26:57 UTC
Thanks for the log.  Looks like the driver's failing the call to StorPortGetDeviceBase.

I noticed that every time the test is restarting the driver, the NNN in "\Device\RaidPortNNN" assigned to the driver increases. This could be the reason why the test fails after exactly 100 iterations.  So, it's not the kind of memory leak that can be debugged with poolmon and similar tools.

Comment 6 Paolo Bonzini 2011-03-29 20:07:00 UTC
Out of curiosity, what version of 2008 are you using? (Standard/Enterprise/Datacenter)

Comment 7 Huang Wenlong 2011-03-30 02:10:17 UTC
(In reply to comment #6)
> Out of curiosity, what version of 2008 are you using?
> (Standard/Enterprise/Datacenter)

Hi,Paolo

We uesd win2k8  Datacenter service pack 2 Build 6002

Wenlong

Comment 8 Paolo Bonzini 2011-03-30 15:20:31 UTC
ok, i can laboriously reproduce it by disabling/enabling RHELSCSI repeatedly.  It fails after 100 or 101 (not sure) successful cycles.  I'll try again under a debugger.

Comment 9 Paolo Bonzini 2011-03-31 20:48:55 UTC
NTSTATUS RaidTranslateResourceListAddress(PVOID StorPortDeviceExtension,
   PVOID HwDeviceExtension,
   DWORD BusType,
   DWORD SystemIoBusNumber,
   LONGLONG IoAddress,
   DWORD NumberOfBytes,
   BOOLEAN InIoSpace,
   PLONGLONG Address)
{
  int n [ebp-18], i [ebp+24];
  DWORD thisSystemIoBusNumber [ebp-10];
  DWORD thisBusType [ebp-14];
  PACCESS_RANGE thisRange [ebp-8];
  BOOLEAN Found [ebp-1];
  thisRange = 0;
  Found = 0;
  *Address = 0;
  n=RaidGetResourceListCount()
  for (i = 0; i < n; i++) {
    RaidGetResourceListElement(StorPortDeviceExtension,
                               ebp-c, &thisRange, &thisBusType, &thisSystemIoBusNumber,
                               i, HwDeviceExtension)
    if (IoAddress >= thisRange->Base &&
        IoAddress + NumberOfBytes < *thisRange->Base + thisRange->Length && ...) {
        ...
        break;
      }
  }

  if (thisRange) {
    ...
    *Address = thisRange->Base;
  }
  return thisRange ? 0 : 0xc0000001;
} 
  

PVOID StorPortGetDeviceBase(
    PVOID  HwDeviceExtension,
    DWORD  BusType,
    ULONG  SystemIoBusNumber,
    LONGLONG  IoAddress,
    ULONG  NumberOfBytes,
    BOOLEAN  InIoSpace
    )
{   
  LONGLONG Address;
  StorPortDeviceExtension = ...
  status = RaidTranslateResourceListAddress(StorPortDeviceExtension,
     HwDeviceExtension,
     BusType,
     SystemIoBusNumber,
     IoAddress,
     NumberOfBytes,
     InIoSpace,
     &Address)
  if (status >= 0)
    {
      if (!InIoSpace) {
        IoAddressVa = MmMapIoSpace(Address,NumberOfBytes,0);
        RaidAllocateAddressMapping(StorPortDeviceExtension,
                                   IoAddress,
                                   IoAddressVa,
                                   HwDeviceExtension,
                                   SystemIoBusNumber, [esi+4])
        return IoAddressVa;
      }

MmMapIoSpace fails.  It first tries a large page allocation, which fails, then it goes through MiReserveSystemPtes.  In this function:

   RtlFindClearBits fails
   MiEmptyPteBins(0) returns 1, so it retries allocation
   RtlFindClearBits fails
   MiExpandPtes(??, NumberOfPages) returns 0, so MiReserveSystemPtes fails

The error is caused by fragmentation of the kernel virtual address space.  I can work around it in the drivers, but it would require a respin.

Comment 10 Paolo Bonzini 2011-03-31 20:51:21 UTC
Small testcase requiring the devcon utility available at http://support.microsoft.com/kb/311272

:a
devcon disable "PCI\*VEN_5853"
choice /t:5 /d n > nul
devcon enable "PCI\*VEN_5853"
choice /t:5 /d n > nul
goto a

Keep an eye on the event viewer.  Sooner or later it will start failing, but if the \Device\RaidPortNNN number in the error event is higher than ~120, the bug is fixed.

Comment 12 Huang Wenlong 2011-04-11 07:39:22 UTC
verify this bug in rhel5 with 
xenpv-win-1.3.4-7.el5.x86_64.rpm
kernel-xen-2.6.18-245.el5.x86_64.rpm
xen-3.0.3-123.el5.x86_64.rpm

1) open a command line as administrator

2) run diskpart.exe

3) type "san policy=OnlineAll" and then "exit".

in the 2k8-32 guest then run the all jobs  ,Common Scenario Stress With IO and others jobs all passed

Comment 13 errata-xmlrpc 2011-06-08 08:15:43 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2011-0853.html


Note You need to log in before you can comment on or make changes to this bug.