| Summary: | WHQL BLK 2k8-32bit Common Scenario Stress With IO fail | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 5 | Reporter: | Huang Wenlong <whuang> | ||||||
| Component: | xenpv-win | Assignee: | Paolo Bonzini <pbonzini> | ||||||
| Status: | CLOSED ERRATA | QA Contact: | Virtualization Bugs <virt-bugs> | ||||||
| Severity: | medium | Docs Contact: | |||||||
| Priority: | high | ||||||||
| Version: | 5.7 | CC: | cshao, cwei, drjones, leiwang, mshao, rwu | ||||||
| Target Milestone: | rc | ||||||||
| Target Release: | --- | ||||||||
| Hardware: | Unspecified | ||||||||
| OS: | Unspecified | ||||||||
| Whiteboard: | |||||||||
| Fixed In Version: | xenpv-win-1.3.4-7.el5 | Doc Type: | Bug Fix | ||||||
| Doc Text: | Story Points: | --- | |||||||
| Clone Of: | Environment: | ||||||||
| Last Closed: | 2011-06-08 08:15:43 UTC | Type: | --- | ||||||
| Regression: | --- | Mount Type: | --- | ||||||
| Documentation: | --- | CRM: | |||||||
| Verified Versions: | Category: | --- | |||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||
| Bug Depends On: | |||||||||
| Bug Blocks: | 518435 | ||||||||
| Attachments: |
|
||||||||
|
Description
Huang Wenlong
2011-03-22 11:07:21 UTC
After running the test multiple times we noticed that the failure is always at or around the 100th cycle, and is due to the PCI device changing to problem code 10 ("the device failed to start"). This may be a bug in xen or, more likely, an OS-specific problem.
The event log (I think the System Log) should have quite a few error messages about the failure, but I couldn't access it from DTM Studio with remote desktop. If you can find them, you can export them and send it here (Start -> Run -> eventvwr.msc, then right click on System -> Save log file as...).
If it doesn't, you can try installing the attached driver file (just copy it into c:\windows\system32\drivers) and create the following file in dom0
while sleep 1; do
xm dmesg | sed '/^$/d'
xm dmesg -c > /dev/null
done
let's call it log-dmesg. Now do
chmod +x log-dmesg
./log-dmesg > bz689748-dmesg.log
run the test and attach the resulting file here. Thanks!
Created attachment 487629 [details]
copy of the driver with extra debugging output going to "xm dmesg"
(In reply to comment #1) > After running the test multiple times we noticed that the failure is always at > or around the 100th cycle, and is due to the PCI device changing to problem > code 10 ("the device failed to start"). This may be a bug in xen or, more > likely, an OS-specific problem. > > The event log (I think the System Log) should have quite a few error messages > about the failure, but I couldn't access it from DTM Studio with remote > desktop. If you can find them, you can export them and send it here (Start -> > Run -> eventvwr.msc, then right click on System -> Save log file as...). > > If it doesn't, you can try installing the attached driver file (just copy it > into c:\windows\system32\drivers) and create the following file in dom0 > > while sleep 1; do > xm dmesg | sed '/^$/d' > xm dmesg -c > /dev/null > done > > let's call it log-dmesg. Now do > > chmod +x log-dmesg > ./log-dmesg > bz689748-dmesg.log > > run the test and attach the resulting file here. Thanks! Hi,Paolo I attach the system log of Event Viewer in 2k8-32. Wenlong Created attachment 488067 [details]
system log
Thanks for the log. Looks like the driver's failing the call to StorPortGetDeviceBase. I noticed that every time the test is restarting the driver, the NNN in "\Device\RaidPortNNN" assigned to the driver increases. This could be the reason why the test fails after exactly 100 iterations. So, it's not the kind of memory leak that can be debugged with poolmon and similar tools. Out of curiosity, what version of 2008 are you using? (Standard/Enterprise/Datacenter) (In reply to comment #6) > Out of curiosity, what version of 2008 are you using? > (Standard/Enterprise/Datacenter) Hi,Paolo We uesd win2k8 Datacenter service pack 2 Build 6002 Wenlong ok, i can laboriously reproduce it by disabling/enabling RHELSCSI repeatedly. It fails after 100 or 101 (not sure) successful cycles. I'll try again under a debugger. NTSTATUS RaidTranslateResourceListAddress(PVOID StorPortDeviceExtension,
PVOID HwDeviceExtension,
DWORD BusType,
DWORD SystemIoBusNumber,
LONGLONG IoAddress,
DWORD NumberOfBytes,
BOOLEAN InIoSpace,
PLONGLONG Address)
{
int n [ebp-18], i [ebp+24];
DWORD thisSystemIoBusNumber [ebp-10];
DWORD thisBusType [ebp-14];
PACCESS_RANGE thisRange [ebp-8];
BOOLEAN Found [ebp-1];
thisRange = 0;
Found = 0;
*Address = 0;
n=RaidGetResourceListCount()
for (i = 0; i < n; i++) {
RaidGetResourceListElement(StorPortDeviceExtension,
ebp-c, &thisRange, &thisBusType, &thisSystemIoBusNumber,
i, HwDeviceExtension)
if (IoAddress >= thisRange->Base &&
IoAddress + NumberOfBytes < *thisRange->Base + thisRange->Length && ...) {
...
break;
}
}
if (thisRange) {
...
*Address = thisRange->Base;
}
return thisRange ? 0 : 0xc0000001;
}
PVOID StorPortGetDeviceBase(
PVOID HwDeviceExtension,
DWORD BusType,
ULONG SystemIoBusNumber,
LONGLONG IoAddress,
ULONG NumberOfBytes,
BOOLEAN InIoSpace
)
{
LONGLONG Address;
StorPortDeviceExtension = ...
status = RaidTranslateResourceListAddress(StorPortDeviceExtension,
HwDeviceExtension,
BusType,
SystemIoBusNumber,
IoAddress,
NumberOfBytes,
InIoSpace,
&Address)
if (status >= 0)
{
if (!InIoSpace) {
IoAddressVa = MmMapIoSpace(Address,NumberOfBytes,0);
RaidAllocateAddressMapping(StorPortDeviceExtension,
IoAddress,
IoAddressVa,
HwDeviceExtension,
SystemIoBusNumber, [esi+4])
return IoAddressVa;
}
MmMapIoSpace fails. It first tries a large page allocation, which fails, then it goes through MiReserveSystemPtes. In this function:
RtlFindClearBits fails
MiEmptyPteBins(0) returns 1, so it retries allocation
RtlFindClearBits fails
MiExpandPtes(??, NumberOfPages) returns 0, so MiReserveSystemPtes fails
The error is caused by fragmentation of the kernel virtual address space. I can work around it in the drivers, but it would require a respin.
Small testcase requiring the devcon utility available at http://support.microsoft.com/kb/311272 :a devcon disable "PCI\*VEN_5853" choice /t:5 /d n > nul devcon enable "PCI\*VEN_5853" choice /t:5 /d n > nul goto a Keep an eye on the event viewer. Sooner or later it will start failing, but if the \Device\RaidPortNNN number in the error event is higher than ~120, the bug is fixed. verify this bug in rhel5 with xenpv-win-1.3.4-7.el5.x86_64.rpm kernel-xen-2.6.18-245.el5.x86_64.rpm xen-3.0.3-123.el5.x86_64.rpm 1) open a command line as administrator 2) run diskpart.exe 3) type "san policy=OnlineAll" and then "exit". in the 2k8-32 guest then run the all jobs ,Common Scenario Stress With IO and others jobs all passed An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2011-0853.html |