Bug 151543 - (Trident Blade 3D) spurious system hangs
Summary: (Trident Blade 3D) spurious system hangs
Keywords:
Status: CLOSED CANTFIX
Alias: None
Product: Red Hat Enterprise Linux 4
Classification: Red Hat
Component: xorg-x11
Version: 4.0
Hardware: x86_64
OS: Linux
medium
medium
Target Milestone: ---
: ---
Assignee: X/OpenGL Maintenance List
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2005-03-19 02:22 UTC by Christopher P Johnson
Modified: 2007-11-30 22:07 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2005-09-26 20:36:13 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
test shell script (230 bytes, text/plain)
2005-04-26 19:39 UTC, Christopher P Johnson
no flags Details
xorg.conf file (2.92 KB, text/plain)
2005-04-26 19:40 UTC, Christopher P Johnson
no flags Details
uname, lsmod, messages, lspci, xorg.conf file (16.66 KB, application/octet-stream)
2005-05-03 21:11 UTC, Christopher P Johnson
no flags Details

Description Christopher P Johnson 2005-03-19 02:22:28 UTC
Description of problem:

Sun Microsystems Inc. sells operton based servers (v20z/v40z) with
Trident Microsystems Blade 3D PCI/AGP video controllers (see below for
details). Error messagees are generated when X is started:

console and /var/log/messages:

mtrr: type mismatch for e5000000,800000 old: write-back new: write-combining

Xorg*.log:

(WW) TRIDENT(0): Failed to set up write-combining range (0xe5000000,0x800000)

Note that this error message is disturbing to customers. There also appear
to be instances where a real error may occur. Finally, these error messages
may keep certain versions of the RHR video certification tests from
succeeding.

Please consider updating trident support in RHEL4 (and RHEL3 if possible)
to resolve this issue.

Version-Release number of selected component (if applicable):

01:05.0 VGA compatible controller: Trident Microsystems Blade 3D PCI/AGP (rev
3a) (prog-if 00 [VGA])
        Subsystem: Newisys, Inc.: Unknown device 0020
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
Stepping- SERR- FastB2B-
        Status: Cap+ 66Mhz+ UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort-
<TAbort- <MAbort- >SERR- <PERR-
        Latency: 64
        Interrupt: pin A routed to IRQ 177
        Region 0: Memory at e5000000 (32-bit, non-prefetchable) [size=8M]
        Region 1: Memory at e4100000 (32-bit, non-prefetchable) [size=128K]
        Region 2: Memory at e4800000 (32-bit, non-prefetchable) [size=8M]
        Capabilities: [80] AGP version 1.0
                Status: RQ=33 Iso- ArqSz=0 Cal=0 SBA+ ITACoh- GART64- HTrans-
64bit- FW- AGP3- Rate=x1,x2
                Command: RQ=1 ArqSz=0 Cal=0 SBA- AGP- GART64- 64bit- FW- Rate=<none>
        Capabilities: [90] Power Management version 1
                Flags: PMEClk- DSI+ D1+ D2+ AuxCurrent=0mA
PME(D0-,D1-,D2-,D3hot-,D3cold-)
                Status: D0 PME-Enable- DSel=0 DScale=0 PME-
00: 23 10 80 98 07 00 b0 02 3a 00 00 03 00 40 00 00
10: 00 00 00 e5 00 00 10 e4 00 00 80 e4 00 00 00 00
20: 00 00 00 00 00 00 00 00 00 00 00 00 c2 17 20 00
30: 00 00 00 00 80 00 00 00 00 00 00 00 0a 01 00 00


How reproducible:

Start the X server, after automatic configuration by rhel.

Comment 1 Mike A. Harris 2005-04-11 10:34:06 UTC
>(WW) TRIDENT(0): Failed to set up write-combining range (0xe5000000,0x800000)
[SNIP]
>Please consider updating trident support in RHEL4 (and RHEL3 if possible)
>to resolve this issue.

This message is a "warning" and not an "error".   There are many reasons
why a user might see this warning in the server log file, many of which
are system hardware limitations.  It does not however have anything to
do with the video driver.

Please contact your Red Hat partner manager or technical support
representative at 1-888-REDHAT1 for further assistance with this
or any other issue.  Global Support Services is the doorway for technical
support for issues of this nature for Red Hat Enterprise Linux.

Hope this helps, thanks.


Comment 2 Christopher P Johnson 2005-04-14 17:59:40 UTC
There are two issues here:

1) there error messages have been seen with systems that lock-up - the X server
has been suggested as the cause.

2) these error message causes redhat ready certification to fail.

The issue appears to be with the underlying vga driver for the vga chipset, and
a newer version of the driver supposedly resolves both these issues.

Comment 5 Mike A. Harris 2005-04-15 06:19:22 UTC
In order to troubleshoot this issue, we need you to follow the following
steps and provide us with various data to assist in diagnosis:

Please perform the following steps:

1) Update your system to the latest kernel and xorg-x11 packages that
   have been released as official updates for RHEL-4.

2) Ensure that you are not using any 3rd party kernel modules, or disable
   them from starting at or after bootup.

3) Reboot your system.  Make sure it boots into the latest RHEL-4 kernel
   update we've released.

4) Run "system-config-display --reconfig" to generate a brand new X
   configuration from scratch.

5) Start the X server

6) Indicate in specific detail, the exact failure you experience, and the
   specific steps to reproduce the problem.  If there is more than one
   type of failure, please file separate support requests with Red Hat
   GSS for each issue, so they can be investigated and resolved individually.

We'll need you to attach the following files as individual uncompressed
bugzilla file attachments using the link below:

- X server log files ( /etc/X11/Xorg.0.log* ) - be sure to include the .old file
- X server config file
- The complete /var/log/messages from the last system boot onward
- the output of "uname -a"
- the output of "lsmod"
- the output of "lspci -vvn"


Assuming the problem is still reproduceable after supplying the above
information, please try adding the following to the device section of
your X server config file:

    Option "NoMTRR"

After this, restart the X server and attempt to reproduce the problem
again.  Please attach the new X server log file (and .old one) and indicate
if the problem still persists or not.

Then try adding the following to the device section of the config:

    Option "noaccel"

Restart again, and attach the log files from this invocation also, and
indicate if the problem persists or not.

Once you've tried these troubleshooting tips and supplied the requested
information, we'll review it and attempt to diagnose the underlying
cause of the problems.

Please be very detailed in your explanation of what occurs, and how to
reproduce it.  Include the exact output of any error messages you see,
or digital pictures of the screen if appropriate.

Thanks in advance.


Comment 6 Mike A. Harris 2005-04-15 06:20:34 UTC
Setting status to "NEEDINFO", awaiting results of troubleshooting
and file attachments.



Comment 7 Mike A. Harris 2005-04-20 11:49:11 UTC
ping

Comment 8 Christopher P Johnson 2005-04-22 19:08:50 UTC
Thanks for the update - I am rerunning the test from bug 113533 (start and stop
X server forever) on rhel4 update1 beta, to see if more debugging information
can be captured on the lock-up.


Comment 9 Mike A. Harris 2005-04-26 19:05:51 UTC
Setting status to "NEEDINFO", awaiting results of troubleshooting
and file attachments.

Comment 10 Christopher P Johnson 2005-04-26 19:39:22 UTC
Some test results:
1) the v40z test machine locked up after a day of running

while true; do
init 5
sleep 15
init 3
sleep 15
done

Couldn't get in to HDT mode to get register dump, sysrq frozen, etc.

2) Created an artificial test script (attached) to try to create the lock-up (not
sure if it's triggering the same error). It locks up within a few minutes,
with standard xorg.conf, and with "Option NoMTRR". It did work for a
few hours with "Option noaccel" also added.

The request is that the X server configured by RHEL 4/3, not potentially lock up
the system.

Comment 11 Christopher P Johnson 2005-04-26 19:39:58 UTC
Created attachment 113687 [details]
test shell script

Comment 12 Christopher P Johnson 2005-04-26 19:40:28 UTC
Created attachment 113688 [details]
xorg.conf file

Comment 13 Mike A. Harris 2005-05-03 12:12:51 UTC
When attaching "text" files to bugzilla, please select the
"text/plain" mime-type, so that it the file attachment is
viewable in any standard web browser.

TIA

Comment 14 Mike A. Harris 2005-05-03 12:24:06 UTC
Several pieces of information requested above in comment #5 are still
missing from the report.  Please re-review comment #5 and attach
the remainder of the information requested above.

We do not have Trident video hardware available to attempt to
reproduce locally and diagnose, so the requested information is
critically important before we can proceed any further with
diagnosis.


Setting status back to "NEEDINFO" and awaiting attachment of
remainder of information requested above.

Thanks in advance.



Comment 16 Mike A. Harris 2005-05-03 12:29:10 UTC
Updating "Summary" to reflect the real symptoms.

Comment 18 Christopher P Johnson 2005-05-03 21:11:13 UTC
Created attachment 113986 [details]
uname, lsmod, messages, lspci, xorg.conf file

Unfortunately the original test machine is gone. Here's information from
a v20z (same trident chipset, exhibits same problem).

Comment 19 Mike A. Harris 2005-05-26 08:12:47 UTC
Please ensure all attachments are always attached as individual
uncompressed file attachments that are web browser viewable.

Thanks in advance.

Comment 22 Mike A. Harris 2005-07-12 12:15:00 UTC
(In reply to comment #2)

> The issue appears to be with the underlying vga driver for the vga chipset, and
> a newer version of the driver supposedly resolves both these issues.

Could you clarify this part for me?  The system default driver for trident,
is the "trident" driver.  There is a "vga" driver also, but it is for
ancient 16 color 640x480 and lower standard VGA hardware from the
early to mid '90's and we never use it by default for any hardware.

The reason I seek clarification, is because there is a different bug
reported against our Fedora Core 4 OS release, which is caused due
to the X server module "libvgahw.a" being miscompiled.  That bug is
bug #161242, which affects a fairly large number of users with a
variety of hardware, including Trident under Fedora Core 4 Xorg which
is compiled with gcc4.  While I dont believe that bug is related to
the problem you're experiencing here, as RHEL4 is compiled with gcc3
and the problem in FC4 is gcc4 specific, I thought I would get you
to confirm that by "vga driver for the vga chipset" you actually meant
"trident driver for Trident chipsets".





Note You need to log in before you can comment on or make changes to this bug.