Description of problem: SERR error on Dell server when running Linux reboot test. Version-Release number of selected component (if applicable): xorg-x11-6.8.1-23.EL This is the open source radeon driver that ships with RHEL4. How reproducible: Run a boot loop test that does the following: 1. Boot to run level 3 and autologin 2. start xwindows 3. reboot Actual results: After a variable number of boots we have seen a system hang that is traced to the way the ATI part is configured. A patch is being pushed to the OSC. The release advisory will be appended shortly by Dell. Specifically the root cause of the failure is as follows: 1. Video display is active before and during the starting of XWindow. Memory read/write operations are occurring between the RV100 and the frame buffer in local video memory. These memory operations occur continuously. 2. Local video memory operations use the base address and address range specified in the following registers. MC_FB_LOCATION (offset 0148) â contains the address range for the frame buffer. DISPLAY_BASE_ADDR (offset 023C) â contains the base addresses for the frame buffer. 3. The first register is decoded as follows: Example: MC_FB_LOCATION = E3FF E000 The LSB is shifted 16 bits to provide MC_FB_START = E000 0000 The MSB is appended with FFFF to provide MC_FB_TOP = E3FF FFFF 4. Thus, for this example, all frame buffer read/write operations would be kept within the range E000 0000 to E3FF FFFF. 5. If one of these registers is changed, but not the other, there will exist a window of opportunity where a transaction can be created (or may be pending from earlier) and will be addressed improperly. That transaction will not fall within the new address range. 6. If this transaction is presented to the RV100 Memory controller when these registers are in flux the memory controller may decide that the request is outside the frame buffer range and will pass the request to BIF (bus interface) which will then pass the request to the PCI bus. See diagram next page. 7. The improper access that is sent to the PCI could be any access that was intended for the Frame Buffer. The memory read or write has been redirected by the inconsistent values in the base and range registers. 8. Note: With DMA turned off these transactions will be discarded by the memory controller and will not be inappropriately redirected. 9. It is clear that changing these registers without either stopping the display or blocking memory accesses is bad programming practice and was an error introduced into the Open Source Driver by the Open Source Community. 10. Our solution stops all DMA activity before the resisters are changed and allows DMA activity to proceed only after the registers are changed and the RV100 has become idle. PCI Trace Extract Here is a tabulation of the PCI Bus activity leading up to the failure. This was obtained directly from the first PCI trace sent from Dell: Reg# Name Operation Value ---- --------------- --- -------- 014C MC_AGP_LOCATION r 003FFFC0 0100 CONFIG_APER_0_BASE r E0000000 0108 CONFIG_APER_SIZE r 04000000 0E40 RBBM_STATUS r 00000140 0E40 RBBM_STATUS r 00000140 342C RB2D_DSTCACHE_CTL_STAT r 00000000 342c RB2D_DSTCACHE_CTL_STAT w 0000000F 342C RB2D_DSTCACHE_CTL_STAT r 00000000 0148 MC_FB_LOCATION w E3FFE000 <--(Improper accesses begin) 014C MC_AGP_LOCATION w E3FFE400 <-- 023C DISPLAY_BASE_ADDR w E0000000 <-- 033C CRTC2_DISPLAY_BASE_ADR w E0000000 <-- (First bad bus-mastered DMA read makes it to PCI here) 043C OV0_BASE_ADR w E0000000 <--(Improper accesses end) (Bad bus-mastered DMA reads are retried on PCI)
Rod- Can you please attach the patch the moment it is available since U2 is locking down and we want to see this patch make U2. Please make sure that the patch applies cleanly to the xorg tree in RHEL4 before submitting it.
Created attachment 117618 [details] Disable bus mastering before changing FB_LOCATION
QE ACK for RHEL 4 only.
Errata updated, setting this bug to modified.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2005-500.html
"xorg-x11-6.8.1-ati-radeon-RV100-bus-master-fix.patch" breaks "DRI" for my Radeon R100 QD PCI card. The server hangs when it tries to launch the login manager. See bug #180150 on this issue. The problem has been confirmed upstream by M. Dänzeer, and a new patch has been suggested for "Xorg" 6.9. I have ported it to release 6.8.2 and it solves the problem for me.