Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

For bugs related to Red Hat Enterprise Linux 5 product line. The current stable release is 5.10. For Red Hat Enterprise Linux 6 and above, please visit Red Hat JIRA https://issues.redhat.com/secure/CreateIssue!default.jspa?pid=12332745 to report new issues.

Bug 689748

Summary:

WHQL BLK 2k8-32bit Common Scenario Stress With IO fail

Product:

Red Hat Enterprise Linux 5

Reporter:

Huang Wenlong <whuang>

Component:

xenpv-win

Assignee:

Paolo Bonzini <pbonzini>

Status:

CLOSED ERRATA

QA Contact:

Virtualization Bugs <virt-bugs>

Severity:

medium

Docs Contact:

Priority:

high

Version:

5.7

CC:

cshao, cwei, drjones, leiwang, mshao, rwu

Target Milestone:

Target Release:

---

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

xenpv-win-1.3.4-7.el5

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2011-06-08 08:15:43 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

518435

Attachments:

Description	Flags
copy of the driver with extra debugging output going to "xm dmesg"	none
system log	none

Description Huang Wenlong 2011-03-22 11:07:21 UTC

Description of problem:

2k8-32 blk Common Scenario Stress With IO fail  with EnumerateDeviceOverride=3 and add the second disk mirror in the first disk 

Version-Release number of selected component (if applicable):
xenpv-win-1.3.4-6.el5.rpm
kernel-xen-2.6.18-245.el5.x86_64.rpm
xen-3.0.3-123.el5.x86_64.rpm


How reproducible:
100%

Steps to Reproduce:
1.install xenpv-win driver 
2.set register EnumerateDevicesOverride = 3
3.set the second disk mirror in first disk 
4.run test job
  
Actual results:
job pass 

Expected results:
job fail

Additional info:
2k8-64bit job pass

Comment 1 Paolo Bonzini 2011-03-25 17:59:25 UTC

After running the test multiple times we noticed that the failure is always at or around the 100th cycle, and is due to the PCI device changing to problem code 10 ("the device failed to start").  This may be a bug in xen or, more likely, an OS-specific problem.

The event log (I think the System Log) should have quite a few error messages about the failure, but I couldn't access it from DTM Studio with remote desktop.  If you can find them, you can export them and send it here (Start -> Run -> eventvwr.msc, then right click on System -> Save log file as...).

If it doesn't, you can try installing the attached driver file (just copy it into c:\windows\system32\drivers) and create the following file in dom0

while sleep 1; do
  xm dmesg | sed '/^$/d'
  xm dmesg -c > /dev/null
done

let's call it log-dmesg.  Now do

chmod +x log-dmesg
./log-dmesg > bz689748-dmesg.log

run the test and attach the resulting file here.  Thanks!

Comment 2 Paolo Bonzini 2011-03-25 18:01:17 UTC

Created attachment 487629 [details]
copy of the driver with extra debugging output going to "xm dmesg"

Comment 3 Huang Wenlong 2011-03-28 02:23:38 UTC

(In reply to comment #1)
> After running the test multiple times we noticed that the failure is always at
> or around the 100th cycle, and is due to the PCI device changing to problem
> code 10 ("the device failed to start").  This may be a bug in xen or, more
> likely, an OS-specific problem.
> 
> The event log (I think the System Log) should have quite a few error messages
> about the failure, but I couldn't access it from DTM Studio with remote
> desktop.  If you can find them, you can export them and send it here (Start ->
> Run -> eventvwr.msc, then right click on System -> Save log file as...).
> 
> If it doesn't, you can try installing the attached driver file (just copy it
> into c:\windows\system32\drivers) and create the following file in dom0
> 
> while sleep 1; do
>   xm dmesg | sed '/^$/d'
>   xm dmesg -c > /dev/null
> done
> 
> let's call it log-dmesg.  Now do
> 
> chmod +x log-dmesg
> ./log-dmesg > bz689748-dmesg.log
> 
> run the test and attach the resulting file here.  Thanks!
Hi,Paolo  

I attach  the system log of Event Viewer in 2k8-32.

Wenlong

Comment 4 Huang Wenlong 2011-03-28 02:24:02 UTC

Created attachment 488067 [details]
system log

Comment 5 Paolo Bonzini 2011-03-28 18:26:57 UTC

Thanks for the log.  Looks like the driver's failing the call to StorPortGetDeviceBase.

I noticed that every time the test is restarting the driver, the NNN in "\Device\RaidPortNNN" assigned to the driver increases. This could be the reason why the test fails after exactly 100 iterations.  So, it's not the kind of memory leak that can be debugged with poolmon and similar tools.

Comment 6 Paolo Bonzini 2011-03-29 20:07:00 UTC

Out of curiosity, what version of 2008 are you using? (Standard/Enterprise/Datacenter)

Comment 7 Huang Wenlong 2011-03-30 02:10:17 UTC

(In reply to comment #6)
> Out of curiosity, what version of 2008 are you using?
> (Standard/Enterprise/Datacenter)

Hi,Paolo

We uesd win2k8  Datacenter service pack 2 Build 6002

Wenlong

Comment 8 Paolo Bonzini 2011-03-30 15:20:31 UTC

ok, i can laboriously reproduce it by disabling/enabling RHELSCSI repeatedly.  It fails after 100 or 101 (not sure) successful cycles.  I'll try again under a debugger.

Comment 9 Paolo Bonzini 2011-03-31 20:48:55 UTC

NTSTATUS RaidTranslateResourceListAddress(PVOID StorPortDeviceExtension,
   PVOID HwDeviceExtension,
   DWORD BusType,
   DWORD SystemIoBusNumber,
   LONGLONG IoAddress,
   DWORD NumberOfBytes,
   BOOLEAN InIoSpace,
   PLONGLONG Address)
{
  int n [ebp-18], i [ebp+24];
  DWORD thisSystemIoBusNumber [ebp-10];
  DWORD thisBusType [ebp-14];
  PACCESS_RANGE thisRange [ebp-8];
  BOOLEAN Found [ebp-1];
  thisRange = 0;
  Found = 0;
  *Address = 0;
  n=RaidGetResourceListCount()
  for (i = 0; i < n; i++) {
    RaidGetResourceListElement(StorPortDeviceExtension,
                               ebp-c, &thisRange, &thisBusType, &thisSystemIoBusNumber,
                               i, HwDeviceExtension)
    if (IoAddress >= thisRange->Base &&
        IoAddress + NumberOfBytes < *thisRange->Base + thisRange->Length && ...) {
        ...
        break;
      }
  }

  if (thisRange) {
    ...
    *Address = thisRange->Base;
  }
  return thisRange ? 0 : 0xc0000001;
} 
  

PVOID StorPortGetDeviceBase(
    PVOID  HwDeviceExtension,
    DWORD  BusType,
    ULONG  SystemIoBusNumber,
    LONGLONG  IoAddress,
    ULONG  NumberOfBytes,
    BOOLEAN  InIoSpace
    )
{   
  LONGLONG Address;
  StorPortDeviceExtension = ...
  status = RaidTranslateResourceListAddress(StorPortDeviceExtension,
     HwDeviceExtension,
     BusType,
     SystemIoBusNumber,
     IoAddress,
     NumberOfBytes,
     InIoSpace,
     &Address)
  if (status >= 0)
    {
      if (!InIoSpace) {
        IoAddressVa = MmMapIoSpace(Address,NumberOfBytes,0);
        RaidAllocateAddressMapping(StorPortDeviceExtension,
                                   IoAddress,
                                   IoAddressVa,
                                   HwDeviceExtension,
                                   SystemIoBusNumber, [esi+4])
        return IoAddressVa;
      }

MmMapIoSpace fails.  It first tries a large page allocation, which fails, then it goes through MiReserveSystemPtes.  In this function:

   RtlFindClearBits fails
   MiEmptyPteBins(0) returns 1, so it retries allocation
   RtlFindClearBits fails
   MiExpandPtes(??, NumberOfPages) returns 0, so MiReserveSystemPtes fails

The error is caused by fragmentation of the kernel virtual address space.  I can work around it in the drivers, but it would require a respin.

Comment 10 Paolo Bonzini 2011-03-31 20:51:21 UTC

Small testcase requiring the devcon utility available at http://support.microsoft.com/kb/311272

:a
devcon disable "PCI\*VEN_5853"
choice /t:5 /d n > nul
devcon enable "PCI\*VEN_5853"
choice /t:5 /d n > nul
goto a

Keep an eye on the event viewer.  Sooner or later it will start failing, but if the \Device\RaidPortNNN number in the error event is higher than ~120, the bug is fixed.

Comment 12 Huang Wenlong 2011-04-11 07:39:22 UTC

verify this bug in rhel5 with 
xenpv-win-1.3.4-7.el5.x86_64.rpm
kernel-xen-2.6.18-245.el5.x86_64.rpm
xen-3.0.3-123.el5.x86_64.rpm

1) open a command line as administrator

2) run diskpart.exe

3) type "san policy=OnlineAll" and then "exit".

in the 2k8-32 guest then run the all jobs  ,Common Scenario Stress With IO and others jobs all passed

Comment 13 errata-xmlrpc 2011-06-08 08:15:43 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2011-0853.html