Bug 487126

Summary: r300: X livelock on resume when compiz is running
Product: [Fedora] Fedora Reporter: Roman Kagan <rkagan>
Component: mesaAssignee: Adam Jackson <ajax>
Status: CLOSED CURRENTRELEASE QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: medium Docs Contact:
Priority: low    
Version: 10CC: ajax, cra, xgl-maint
Target Milestone: ---   
Target Release: ---   
Hardware: i686   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2009-05-27 23:42:01 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Xorg.0.log none

Description Roman Kagan 2009-02-24 12:30:29 UTC
Created attachment 333038 [details]
Xorg.0.log

Description of problem:
h/w:
IBM ThinkPad T43p 2687-D5U
ATI Technologies Inc M24GL [Mobility FireGL V3200] (rev 80)

When running compiz, doing a suspend+resume results in a locked up X.

Version-Release number of selected component (if applicable):
The problem appeared throughout the F10 lifetime; the latest versions are:

mesa-dri-drivers-7.2-0.15.fc10.i386
libdrm-2.4.0-0.21.fc10.i386
xorg-x11-server-Xorg-1.5.3-6.fc10.i386
xorg-x11-drv-ati-6.10.0-2.fc10.i386
kernel-PAE-2.6.27.15-170.2.24.fc10.i686

How reproducible:
always

Steps to Reproduce:
1. boot with nomodeset (IIRC with KMS on the problem remains)
2. run gnome session with compiz (aka desktop effects on)
3. suspend + resume
  
Actual results:
X locks up:
- screen is not updated
- no response to keyboard
the system stays alive; connecting via ssh shows that
- X eats 100% CPU
- Xorg.0.log (attached) reports a detected infinite loop
- strace reports endless loop of of signals + sigreturn()
- gdb never succeeds to attach to the running X


Expected results:
X continues to work at the point where it was suspended.

Additional info:
The system resumes fine with metacity (aka desktop effects off).

Comment 1 Roman Kagan 2009-02-24 12:36:32 UTC
Enabling debug logging from drm via

# echo 1 > /sys/module/drm/parameters/debug

after the resume showed an endless stream of messages in /proc/kmsg

<7>[drm:drm_ioctl] pid=2502, cmd=0xc0086451, nr=0x51, dev 0xe200, auth=1
<7>[drm:radeon_cp_getparam] pid=2502
<7>[drm:drm_ioctl] pid=2502, cmd=0xc0086451, nr=0x51, dev 0xe200, auth=1
<7>[drm:radeon_cp_getparam] pid=2502
<7>[drm:drm_ioctl] pid=2502, cmd=0xc0086451, nr=0x51, dev 0xe200, auth=1
<7>[drm:radeon_cp_getparam] pid=2502
...

Comment 2 Roman Kagan 2009-02-24 12:55:07 UTC
Relevant part from the X calltrace extracted from attachment 333038 [details] above (with addresses translated into source code line numbers with eu-addr2line):

11: /usr/lib/libdrm.so.2 [0x4d4d6cf]
    libdrm-20080930/libdrm/xf86drm.c:187
        in drmIoctl()

12: /usr/lib/libdrm.so.2(drmCommandWriteRead+0x34) [0x4d4d934]
    libdrm-20080930/libdrm/xf86drm.c:2342
        in drmCommandWriteRead()

13: /usr/lib/dri/r300_dri.so [0x2b677a]
    mesa-20081001/src/mesa/drivers/dri/r300/radeon_ioctl.c:69
        in radeonGetLastFrame()

14: /usr/lib/dri/r300_dri.so [0x2b690f]
    mesa-20081001/src/mesa/drivers/dri/r300/radeon_ioctl.c:135
        in radeonWaitForFrameCompletion()

15: /usr/lib/dri/r300_dri.so(radeonCopyBuffer+0xd2) [0x2b6c79]
    mesa-20081001/src/mesa/drivers/dri/r300/radeon_ioctl.c:189
        in radeonCopyBuffer()

Comment 3 Roman Kagan 2009-02-24 13:10:27 UTC
The endless loop is in radeonWaitForFrameCompletion():

mesa-20081001/src/mesa/drivers/dri/r300/radeon_ioctl.c:135

...
                                while (radeonGetLastFrame(radeon) <
                                       sarea->last_frame) ;
...

Apparently the loop condition never goes false.

Doing a binary edit (I didn't have all the tools to rebuild the patched version from source) of /usr/lib/dri/r300_dri.so changing the jump address to quit the loop after the first iteration, equivalent of the following patch

--- a/src/mesa/drivers/dri/r300/radeon_ioctl.c
+++ b/src/mesa/drivers/dri/r300/radeon_ioctl.c
@@ -132,7 +132,7 @@
 	if (radeon->do_irqs) {
 		if (radeonGetLastFrame(radeon) < sarea->last_frame) {
 			if (!radeon->irqsEmitted) {
-				while (radeonGetLastFrame(radeon) <
+				if (radeonGetLastFrame(radeon) <
 				       sarea->last_frame) ;
 			} else {
 				UNLOCK_HARDWARE(radeon);

made it resume successfully.

I can't claim I understand the possible impact of the change; however I'm running the patched version with compiz for 4 days now; it survived 14 suspend/resume cycles with no problem noticed.

Comment 4 Roman Kagan 2009-02-26 15:05:30 UTC
Forgot to note that in Fedora 9 suspend/resume with compiz on this machine worked just fine; the regression showed up when upgrading to F10 in November.

Comment 5 Roman Kagan 2009-03-02 17:39:27 UTC
drm-radeon-fix-upstream-suspend.patch included in the newer kernels addresses this issue from the right angle: by resetting the relevant members of sarea data structure on resume.

Now suspend/resume with compiz works for me with unmodified mesa and the latest kernel from koji:

# rpmverify -f /usr/lib/dri/r300_dri.so 
# uname -r
2.6.27.19-170.2.38.fc10.i686.PAE
# grep -c -i resume /var/log/Xorg.0.log
8

Feel free to close the bug.