Bug 165179 (IT_77281) - In appropriate DMA causes hang on Dell Server with Radeon 7000.
Summary: In appropriate DMA causes hang on Dell Server with Radeon 7000.
Keywords:
Status: CLOSED ERRATA
Alias: IT_77281
Product: Red Hat Enterprise Linux 4
Classification: Red Hat
Component: xorg-x11
Version: 4.0
Hardware: i386
OS: Linux
medium
high
Target Milestone: ---
: ---
Assignee: Kristian Høgsberg
QA Contact: Brian Brock
URL:
Whiteboard:
Depends On:
Blocks: 156322
TreeView+ depends on / blocked
 
Reported: 2005-08-04 22:13 UTC by Rod Macdonald
Modified: 2007-11-30 22:07 UTC (History)
3 users (show)

Fixed In Version: RHBA-2005-500
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2005-10-05 14:34:16 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Disable bus mastering before changing FB_LOCATION (1.15 KB, patch)
2005-08-10 18:16 UTC, jon chaplick
no flags Details | Diff


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2005:500 0 qe-ready SHIPPED_LIVE xorg-x11 bug fix and enhancement update 2005-10-05 04:00:00 UTC

Description Rod Macdonald 2005-08-04 22:13:42 UTC
Description of problem:
SERR error on Dell server when running Linux reboot test.

Version-Release number of selected component (if applicable):
xorg-x11-6.8.1-23.EL
This is the open source radeon driver that ships with RHEL4.

How reproducible:
Run a boot loop test that does the following:
1.  Boot to run level 3 and autologin
2.  start xwindows
3.  reboot 


  
Actual results:

After a variable number of boots we have seen a system hang that is traced to 
the way the ATI part is configured.  A patch is being pushed to the OSC.  The 
release advisory will be appended shortly by Dell.

Specifically the root cause of the failure is as follows:

1.	Video display is active before and during the starting of XWindow.  
Memory read/write operations are occurring between the RV100 and the frame 
buffer in local video memory.  These memory operations occur continuously. 

2.	Local video memory operations use the base address and address range 
specified in the following registers. 
       MC_FB_LOCATION (offset 0148) â contains the address range for the frame 
buffer.
       DISPLAY_BASE_ADDR (offset 023C) â contains the base addresses for the 
frame buffer.

3.	The first register is decoded as follows: 
Example: MC_FB_LOCATION = E3FF E000
The LSB is shifted 16 bits to provide MC_FB_START = E000 0000
The MSB is appended with FFFF to provide MC_FB_TOP = E3FF FFFF

4.	Thus, for this example, all frame buffer read/write operations would 
be kept within the range E000 0000 to E3FF FFFF.

5.	If one of these registers is changed, but not the other, there will 
exist a window of opportunity where a transaction can be created (or may be 
pending from earlier) and will be addressed improperly.  That transaction will 
not fall within the new address range.

6.	If this transaction is presented to the RV100 Memory controller when 
these registers are in flux the memory controller may decide that the request 
is outside the frame buffer range and will pass the request to BIF (bus 
interface) which will then pass the request to the PCI bus.  See diagram next 
page.

7.	The improper access that is sent to the PCI could be any access that 
was intended for the Frame Buffer.  The memory read or write has been 
redirected by the inconsistent values in the base and range registers. 

8.	Note:  With DMA turned off these transactions will be discarded by the 
memory controller and will not be inappropriately redirected. 

9.	It is clear that changing these registers without either stopping the 
display or blocking memory accesses is bad programming practice and was an 
error introduced into the Open Source Driver by the Open Source Community. 

10.	Our solution stops all DMA activity before the resisters are changed 
and allows DMA activity to proceed only after the registers are changed and 
the RV100 has become idle. 


 
PCI Trace Extract

Here is a tabulation of the PCI Bus activity leading up to the failure.  This 
was obtained directly from the first PCI trace sent from Dell:

Reg#     Name              Operation  Value   
----     ---------------        ---   --------
014C     MC_AGP_LOCATION         r    003FFFC0  
0100     CONFIG_APER_0_BASE      r    E0000000
0108     CONFIG_APER_SIZE        r    04000000
0E40     RBBM_STATUS             r    00000140
0E40     RBBM_STATUS             r    00000140
342C     RB2D_DSTCACHE_CTL_STAT  r    00000000
342c     RB2D_DSTCACHE_CTL_STAT  w    0000000F   
342C     RB2D_DSTCACHE_CTL_STAT  r    00000000   
0148     MC_FB_LOCATION          w    E3FFE000   <--(Improper accesses begin)
014C     MC_AGP_LOCATION         w    E3FFE400   <--
023C     DISPLAY_BASE_ADDR       w    E0000000   <--
033C     CRTC2_DISPLAY_BASE_ADR  w    E0000000   <--

(First bad bus-mastered DMA read makes it to PCI here) 

043C     OV0_BASE_ADR            w    E0000000   <--(Improper accesses end)

(Bad bus-mastered DMA reads are retried on PCI)

Comment 1 Amit Bhutani 2005-08-05 20:42:44 UTC
Rod- Can you please attach the patch the moment it is available since U2 is 
locking down and we want to see this patch make U2. Please make sure that the 
patch applies cleanly to the xorg tree in RHEL4 before submitting it.

Comment 2 jon chaplick 2005-08-10 18:16:27 UTC
Created attachment 117618 [details]
Disable bus mastering before changing FB_LOCATION

Comment 8 Tom Kincaid 2005-08-17 19:46:37 UTC
QE ACK for RHEL 4 only.

Comment 14 Kristian Høgsberg 2005-08-19 19:46:23 UTC
Errata updated, setting this bug to modified.

Comment 17 Red Hat Bugzilla 2005-10-05 14:34:17 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2005-500.html


Comment 20 Joachim Frieben 2006-03-16 12:57:17 UTC
"xorg-x11-6.8.1-ati-radeon-RV100-bus-master-fix.patch" breaks "DRI"
for my Radeon R100 QD PCI card. The server hangs when it tries to
launch the login manager. See bug #180150 on this issue.
The problem has been confirmed upstream by M. Dänzeer, and a new
patch has been suggested for "Xorg" 6.9. I have ported it to release
6.8.2 and it solves the problem for me.


Note You need to log in before you can comment on or make changes to this bug.