493068 – reproducible X hang with KMS on ATI RS690M

Bug 493068 - reproducible X hang with KMS on ATI RS690M

Summary: reproducible X hang with KMS on ATI RS690M

Keywords:
Status:	CLOSED RAWHIDE
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	xorg-x11-drv-ati
Sub Component:
Version:	11
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	high
Target Milestone:	---
Assignee:	Dave Airlie
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	500503 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2009-03-31 14:35 UTC by Michal Schmidt
Modified:	2009-10-14 14:50 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2009-10-14 14:50:59 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
dmesg (40.83 KB, text/plain) 2009-03-31 14:39 UTC, Michal Schmidt	no flags	Details
/var/log/Xorg.0.log (50.10 KB, text/plain) 2009-03-31 14:40 UTC, Michal Schmidt	no flags	Details
/etc/X11/xorg.conf (826 bytes, text/plain) 2009-03-31 15:45 UTC, Michal Schmidt	no flags	Details
View All

Description Michal Schmidt 2009-03-31 14:35:06 UTC

Description of problem:
KMS is unusable on Rawhide on my laptop with:
01:05.0 VGA compatible controller [0300]: ATI Technologies Inc RS690M [Radeon X1200 Series] [1002:791f]
Within a few minutes after login Xorg hangs. The mouse pointer still moves. Other processes continue to live normally. I can ssh into the machine and see what Xorg is doing. It is sleeping with this stack trace:

# cat /proc/$(pidof Xorg)/stack
[<ffffffffa002228a>] drm_fence_object_wait+0x163/0x242 [drm]
[<ffffffffa0022de8>] drm_bo_wait+0x140/0x1ec [drm]
[<ffffffffa0017b1c>] drm_bo_vm_fault+0x5f/0x22f [drm]
[<ffffffff810c06b7>] __do_fault+0x55/0x38c
[<ffffffff810c2909>] handle_mm_fault+0x34d/0x7aa
[<ffffffff81398e66>] do_page_fault+0x5c7/0xa13
[<ffffffff81396955>] page_fault+0x25/0x30
[<ffffffffffffffff>] 0xffffffffffffffff

Version-Release number of selected component (if applicable):
kernel-2.6.29-21.fc11.x86_64
xorg-x11-server-Xorg-1.6.0-16.fc11.x86_64
xorg-x11-drv-ati-6.12.0-2.fc11.x86_64

How reproducible:
very easily, repeatedly

Steps to Reproduce:
1. use Firefox, maximize it, have a few tabs open, then close it and let it save the session
2. reboot with KMS enabled (without nomodeset)
3. login into Gnome and run Firefox
  
Actual results:
Firefox's starts maximized and tries to restore the previous session. Something it is drawing causes the hang.

Expected results:
X should not hang.

Comment 1 Michal Schmidt 2009-03-31 14:39:27 UTC

Created attachment 337314 [details]
dmesg

Comment 2 Michal Schmidt 2009-03-31 14:40:03 UTC

Created attachment 337315 [details]
/var/log/Xorg.0.log

Comment 3 Michal Schmidt 2009-03-31 15:45:49 UTC

Created attachment 337325 [details]
/etc/X11/xorg.conf

This is the /etc/X11/xorg.conf that goes together with the previously attached dmesg and Xorg.0.log. I usually do not have one, so I forgot to attach it then.
Nevertheless I have now moved xorg.conf away to let Xorg use full autodetection and the hang is still reproducible.

Comment 4 Michal Schmidt 2009-03-31 16:16:33 UTC

The hang is nicely reproducible, just sometimes the failure state looks different.
Sometimes instead of mostly sleeping, the Xorg process starts spinning in the kernel, spending 100% CPU time somewhere in radeon_gem_ib_get():

[root@leela ~]# cat /proc/$(pidof Xorg)/stack
[<ffffffff81046eca>] __cond_resched+0x32/0x5b
[<ffffffffa0021a18>] drm_fence_reference_unlocked+0x20/0x39 [drm]
[<ffffffffa0022dc5>] drm_bo_wait+0x11d/0x1ec [drm]
[<ffffffffa005d326>] radeon_gem_ib_get+0xd6/0x22f [radeon]
[<ffffffffa0060f50>] radeon_cs_ioctl+0x2e1/0x3b5 [radeon]
[<ffffffffa0012cee>] drm_ioctl+0x1ea/0x27f [drm]
[<ffffffff810f0448>] vfs_ioctl+0x6f/0x87
[<ffffffff810f08cb>] do_vfs_ioctl+0x46b/0x4ac
[<ffffffff810f0962>] sys_ioctl+0x56/0x79
[<ffffffff8101133a>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff
[root@leela ~]# cat /proc/$(pidof Xorg)/stack
[<ffffffff81046eca>] __cond_resched+0x32/0x5b
[<ffffffffa0022e01>] drm_bo_wait+0x159/0x1ec [drm]
[<ffffffffa005d326>] radeon_gem_ib_get+0xd6/0x22f [radeon]
[<ffffffffa0060f50>] radeon_cs_ioctl+0x2e1/0x3b5 [radeon]
[<ffffffffa0012cee>] drm_ioctl+0x1ea/0x27f [drm]
[<ffffffff810f0448>] vfs_ioctl+0x6f/0x87
[<ffffffff810f08cb>] do_vfs_ioctl+0x46b/0x4ac
[<ffffffff810f0962>] sys_ioctl+0x56/0x79
[<ffffffff8101133a>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff

It can't be killed with SIGKILL when in this state.

Comment 5 Michal Schmidt 2009-04-01 11:03:40 UTC

I ran firefox with --sync hoping it would help to see what it is doing when X hangs:

[root@leela ~]# cat /proc/$(pidof Xorg)/stack
[<ffffffffa0022292>] drm_fence_object_wait+0x163/0x242 [drm]
[<ffffffffa0022df0>] drm_bo_wait+0x140/0x1ec [drm]
[<ffffffffa0017b1c>] drm_bo_vm_fault+0x5f/0x22f [drm]
[<ffffffff810c077b>] __do_fault+0x55/0x38c
[<ffffffff810c29cd>] handle_mm_fault+0x34d/0x7aa
[<ffffffff813c64b6>] do_page_fault+0x5c7/0xa13
[<ffffffff813c3fa5>] page_fault+0x25/0x30
[<ffffffffffffffff>] 0xffffffffffffffff

[root@leela ~]# pstack $(pidof Xorg)
#0  0x00000039e4e822e2 in memcpy () from /lib64/libc.so.6
#1  0x00007f491fd42f37 in fbBlt () from /usr/lib64/xorg/modules//libfb.so
#2  0x00007f491fd43310 in fbBltStip () from /usr/lib64/xorg/modules//libfb.so
#3  0x00007f491fd484e8 in fbGetImage () from /usr/lib64/xorg/modules//libfb.so
#4  0x00007f491fb288ac in exaGetImage ()
#5  0x00000000004d95bd in ?? ()
#6  0x000000000044798d in ProcGetImage ()
#7  0x0000000000447164 in Dispatch ()
#8  0x000000000042d095 in main ()

[root@leela ~]# pstack $(pidof firefox)
...
Thread 1 (Thread 0x7f93c2dea710 (LWP 4457)):
#0  0x00000039e4ed81b2 in select () from /lib64/libc.so.6
#1  0x00000039e7608746 in _xcb_conn_wait () from /usr/lib64/libxcb.so.1
#2  0x00000039e760a1dc in xcb_wait_for_reply () from /usr/lib64/libxcb.so.1
#3  0x00000039e6e4d073 in _XReply () from /usr/lib64/libX11.so.6
#4  0x00000039e6e29475 in XGetImage () from /usr/lib64/libX11.so.6
#5  0x000000346bc45e00 in _get_image_surface () from /usr/lib64/libcairo.so.2
#6  0x000000346bc468c9 in _cairo_xlib_surface_acquire_source_image ()
#7  0x000000346bc2c947 in _cairo_surface_acquire_source_image ()
#8  0x000000346bc2348c in _cairo_pattern_acquire_surface ()
#9  0x000000346bc24986 in _cairo_pattern_acquire_surfaces ()
#10 0x000000346bc177b8 in _cairo_image_surface_composite ()
#11 0x000000346bc2c7cd in _cairo_surface_composite ()
#12 0x000000346bc2ec6f in _clip_and_composite_trapezoids ()
#13 0x000000346bc2f39d in _cairo_surface_fallback_paint ()
#14 0x000000346bc2be9f in _cairo_surface_paint () from /usr/lib64/libcairo.so.2
#15 0x000000346bc1486a in _cairo_gstate_paint () from /usr/lib64/libcairo.so.2
#16 0x000000346bc0eb99 in cairo_paint () from /usr/lib64/libcairo.so.2
#17 0x0000003147165acc in ?? () from /usr/lib64/xulrunner-1.9.1/libxul.so
#18 0x0000003147165d11 in ?? () from /usr/lib64/xulrunner-1.9.1/libxul.so
#19 0x000000314717da64 in gfxGdkNativeRenderer::Draw(gfxContext*, int, int, unsigned int, gfxGdkNativeRenderer::DrawOutput*) ()
#20 0x0000003147063752 in ?? () from /usr/lib64/xulrunner-1.9.1/libxul.so
...

Comment 6 Michal Schmidt 2009-04-02 11:34:25 UTC

It is reproducible with
kernel-2.6.29.1-37.rc1.fc11.x86_64
xorg-x11-drv-ati-6.12.1-2.fc11.x86_64

(Changing to component xorg-x11-drv-ati, because:
Apr 01 23:55:13 <fcami> airlied: I suppose we still agree about switching KMS bugs component from kernel to xorg-x11-drv-ati ?
Apr 01 23:55:22 <airlied>       fcami: yes we should do that for now alright.
)

Comment 7 Michal Schmidt 2009-04-11 23:21:14 UTC

Still reproducible with
kernel-2.6.29.1-68.fc11.x86_64
xorg-x11-drv-ati-6.12.1-10.fc11.x86_64

Comment 8 Michal Schmidt 2009-04-15 14:59:18 UTC

Still the same with
kernel-2.6.29.1-70.fc11.x86_64
xorg-x11-drv-ati-6.12.2-2.fc11.x86_64

Comment 9 Hin-Tak Leung 2009-04-16 21:10:00 UTC

I feel the problem is somewhere between the userland drm (mesa-dri-drivers and the kernel/hardware. The issue also happens when using the radeonhd driver. Can sombody from mesa-dri take a look?

Comment 10 Pete Zaitcev 2009-04-17 18:02:51 UTC

I'm more concerned that switching DRI off in xorg.conf fails to disable
DRM and the box hangs anyway. Same stack trace that Michal posted.

Since it's my primary workstation, I don't have an option of running
in text mode, so I was limping along by hand-rolling kernels with DRM
disabled.

Maybe it's a ploy by Dave to force peeved kernel programmers to join
the effort :-)  First they fix the lockups, next you know they hack
away on the code...

Comment 11 Michal Schmidt 2009-04-17 18:29:07 UTC

For now I just run with 'nomodeset' and with Option "AccelMethod" "xaa". This looks like the only combination that is usable for me. It avoids this bug, bug 472505 and bug 463023. Subjectively it even seems a bit faster in my usual 2D usage. GLX does not work, but I don't need it.

Comment 12 Hin-Tak Leung 2009-04-17 19:39:21 UTC

(In reply to comment #10)

> Maybe it's a ploy by Dave to force peeved kernel programmers to join
> the effort :-)  First they fix the lockups, next you know they hack
> away on the code...  

:-)

(not a kernel programmer) I did start poking around the kernel...
'echo 1 > /sys/module/drm/parameters/debug' switches on debugging at runtime (from an ssh session from a different machine, *after* X hung), it starts dumping about 5 lines repeatedly to dmesg from then - the message makes no sense to me, but I attached it to one of the other radeon/radeonhd bug reports hoping somebody else can read that gibberish. It doesn't tell you why it starts hanging, but probably will tell you why it *continues* to hang, if you understand what those 5 lines of compact/terse message says. 

I think a combo of some extra debug code in the kernel drm, and 'echo 1' when it happens from an ssh session, can help to solve the issue... (The userland code is traceable with 'setenforce 0' then just gdb the X server and bt, again remotely from ssh - forgive me if I got this tips from you, it was from an @redhat guy...)

So a lot of time and a lot of patience can probably find the cause of hang, but I have neither, nor any familiarity with the code involved (and hope not to become familiar with it...).

Comment 13 Michal Schmidt 2009-04-23 12:18:55 UTC

xorg-x11-drv-ati-6.12.2-5.fc11.x86_64 looks good! I tried my reproducer several times and it survived.

Comment 14 Pete Zaitcev 2009-04-23 21:04:42 UTC

I had the same stack dump as Michal and X seems to work with the
6.12.2-5 from Koji. I think we can close this.

Comment 15 Hin-Tak Leung 2009-04-23 23:56:10 UTC

Yes, quite exciting - been haunted by the hang for a while, and the only workaround seems to be both avoiding using xpdf and switch to radeonhd. so I ran xpdf just before I switch to ati and within 5 minutes of re-sizing/playing with bookmarks in xpdf, it locks up... and rebooted without a xorg.conf, and played with xpdf for ages and it stays.

So thanks a lot. I just like the flickering to go away now... it doesn't flicker much under radeonhd, but it flickers very often with ati+EXA. :-).

Now I am going to have a look at that patch and see what it does...

Comment 16 Hin-Tak Leung 2009-04-24 00:03:19 UTC

oh, btw, it seems to be kernel-sensitive - I tried booting 2.6.29.2-45.rc1.fc10
and it kernel-oops'ed when starting the X server. (I have mostly still a f10 system, with f11 kernels, mesa*, ati and radeon, and xserver 1.6 rpm-rebuilt from f11, so when I see a f10 kernel on koji I still want it. hardware-wise, I am mostly f11 already with kernel, mesa* and xserver 1.6, ati and radeon).

The kernel-oops is probably somewhat self-inflicted, but mix-n-match kernel and xserver shouldn't really cause kernel oops?

Comment 17 Hin-Tak Leung 2009-04-24 02:27:47 UTC

I extracted the "radeon-6.12.2-fix-rs690-clamp.patch" from the koji src rpm, and I think there is a typo/bug there? As far as I see it is adding R300_TX_CLAMP_R() wherever R300_TX_CLAMP_S/R300_TX_CLAMP_T happens, but one chunk doesn't make sense:

 	case RepeatReflect:
 	    txfilter |= R300_TX_CLAMP_S(R300_TX_CLAMP_MIRROR) |
+		        R300_TX_CLAMP_T(R300_TX_CLAMP_MIRROR) |
 		        R300_TX_CLAMP_T(R300_TX_CLAMP_MIRROR);
 	    break;

This additional line has no effect, since "a|a" is "a", but it would make sense if it were a typo, meant to be _R instead of _T there.

Also the first case, case "RepeatNormal", don't we also need a 
                txfilter |= R300_TX_CLAMP_R(R300_TX_CLAMP_WRAP);
somewhere, in addition to
                txfilter |= R300_TX_CLAMP_R(R300_TX_CLAMP_CLAMP_GL);
?

Sorry for the questions.

Comment 18 Hin-Tak Leung 2009-04-24 02:34:23 UTC

I see there is no need to "|= R300_TX_CLAMP_R(R300_TX_CLAMP_WRAP)", since it is zero, what about the typo/no-op/ then?

Comment 19 Dave Airlie 2009-04-24 12:05:09 UTC

there is another fix that fixes the fix in 6.12.2-6.

hopefully it doesn't undo any of the good.

I should push it back into the F-10 package as well at some point.

Comment 20 Michal Schmidt 2009-04-24 13:49:34 UTC

I'm running 6.12.2-6.fc11 now and it works fine also. Thank you Dave.

Comment 21 Hin-Tak Leung 2009-04-24 19:03:14 UTC

zvideo video playback is broken...

Comment 22 Hin-Tak Leung 2009-04-24 19:34:39 UTC

The 6.12.2-5 -> 6.12.2-6 changes seems to consist of just the CLAMP_BORDER change and that's what broken xvideo... (I went back to 6.12.2-5 and got xvideo back - also the change in entirely in textture video code, which make sense). also the typo/no-op mentioned in comment 17?

Comment 23 Michal Schmidt 2009-04-26 22:38:59 UTC

I opened bug 497755 for the broken xvideo.

Comment 24 Michal Schmidt 2009-05-05 13:41:13 UTC

Reopening, since the fix for RS690 was reverted in 6.2.12-7 and the hang is reproducible in the latest Koji build 6.2.12-11.

Comment 25 Michal Schmidt 2009-05-06 08:13:55 UTC

The workaround in -12 seems to be effective. No hangs. Please push to dist-f11.
xorg-x11-drv-ati-6.12.2-12.fc11.x86_64
xorg-x11-server-Xorg-1.6.1-11.fc11.x86_64

Comment 26 Hin-Tak Leung 2009-05-06 08:41:00 UTC

png still busted with -12 - see comment in bug 497427

Comment 27 Pete Zaitcev 2009-05-12 22:54:29 UTC

*** Bug 500503 has been marked as a duplicate of this bug. ***

Comment 28 Pete Zaitcev 2009-05-12 22:57:01 UTC

PNG was fine for me with -9 and -13, but in any case let's not sidetrack,
and let David concentrate on the hung. Note that it _appears_ that -9
was ok hang-wise for me, but -13 hung. Took it quite a while though,
so it may be a chance event.

Comment 29 Hin-Tak Leung 2009-05-13 00:58:43 UTC

It is a bit strange why 3D texture clamp has any effect on 2D acceleration - maybe one needs to look into what application (mesa?) is generating 3D texture...

Comment 30 Pete Zaitcev 2009-05-21 18:07:05 UTC

Still hangs with:

kernel-2.6.29.3-155.fc11.x86_64
xorg-x11-server-Xorg-1.6.1-11.fc11.x86_64
xorg-x11-drv-ati-6.12.2-14.fc11.x86_64

Stack:
[<ffffffff81042d14>] __cond_resched+0x32/0x5b
[<ffffffffa00ed78c>] drm_fence_reference_unlocked+0x1e/0x39 [drm]
[<ffffffffa00eea6b>] drm_bo_wait+0xf5/0x1c2 [drm]
[<ffffffffa012a214>] radeon_gem_ib_get+0xd4/0x22d [radeon]
[<ffffffffa012da8d>] radeon_cs_ioctl+0x2c2/0x3ad [radeon]
[<ffffffffa00deb7e>] drm_ioctl+0x20e/0x2c5 [drm]
[<ffffffff810e0e94>] vfs_ioctl+0x6f/0x87
[<ffffffff810e132f>] do_vfs_ioctl+0x462/0x4a3
[<ffffffff810e13c6>] sys_ioctl+0x56/0x79
[<ffffffff8101133a>] system_call_fastpath+0x16/0x1b

David, should I clone this bug away from Michal's? It may be a different
root cause.

Comment 31 Hin-Tak Leung 2009-05-23 11:27:01 UTC

I haven't had a hang since I started doing my own patch:

https://bugzilla.redhat.com/show_bug.cgi?id=497427#c20

FWIW, effective TX_CLAMP patches were only in between -6 and -11 - then David Airlie went for something different in -12 onwards.

I think David is on holiday... would like to see him comment on that patch.

Comment 32 Bug Zapper 2009-06-09 12:51:14 UTC

This bug appears to have been reported against 'rawhide' during the Fedora 11 development cycle.
Changing version to '11'.

More information and reason for this action is here:
http://fedoraproject.org/wiki/BugZappers/HouseKeeping

Comment 33 Jérôme Glisse 2009-10-14 10:32:40 UTC

Does it works any better with fedora 12 ? (lastest test livecd for instance) here with fedora 12 i don't seem to have any issue.

Comment 34 Michal Schmidt 2009-10-14 12:54:32 UTC

Fedora 12 works for me, no hangs.

Comment 35 Jérôme Glisse 2009-10-14 14:50:59 UTC

Closing.

Note You need to log in before you can comment on or make changes to this bug.