Red Hat Bugzilla – Bug 625187
[NVa5] F14: NVA3/NVA5/NVA8 issues (random hang, no 3d, no Xv)
Last modified: 2012-03-28 04:00:17 EDT
[ 145.403] [mi] EQ overflowing. The server is probably stuck in an infinite loop.
[ 145.404] 0: /usr/bin/Xorg (xorg_backtrace+0x28) [0x49ffa8]
[ 145.404] 1: /usr/bin/Xorg (mieqEnqueue+0x1f4) [0x49f4d4]
[ 145.404] 2: /usr/bin/Xorg (xf86PostMotionEventP+0xc4) [0x47c274]
[ 145.404] 3: /usr/lib64/xorg/modules/input/evdev_drv.so (0x7fecfdebd000+0x4265) [0x7fecfdec1265]
[ 145.405] 4: /usr/bin/Xorg (0x400000+0x6a157) [0x46a157]
[ 145.405] 5: /usr/bin/Xorg (0x400000+0x1185f3) [0x5185f3]
[ 145.405] 6: /lib64/libc.so.6 (0x328ee00000+0x33f80) [0x328ee33f80]
[ 145.405] 7: /lib64/libc.so.6 (ioctl+0x7) [0x328eed98e7]
[ 145.405] 8: /usr/lib64/libdrm.so.2 (drmIoctl+0x28) [0x329c603388]
[ 145.405] 9: /usr/lib64/libdrm.so.2 (drmCommandWrite+0x1b) [0x329c60360b]
[ 145.405] 10: /usr/lib64/libdrm_nouveau.so.1 (0x7fecff9b0000+0x2dfd) [0x7fecff9b2dfd]
[ 145.405] 11: /usr/lib64/libdrm_nouveau.so.1 (nouveau_bo_map_range+0xfe) [0x7fecff9b2fee]
[ 145.406] 12: /usr/lib64/xorg/modules/drivers/nouveau_drv.so (0x7fecffbb5000+0x5cd2) [0x7fecffbbacd2]
[ 145.406] 13: /usr/lib64/xorg/modules/libexa.so (0x7fecfef66000+0x5f97) [0x7fecfef6bf97]
[ 145.406] 14: /usr/lib64/xorg/modules/libexa.so (0x7fecfef66000+0x8bc2) [0x7fecfef6ebc2]
[ 145.406] 15: /usr/lib64/xorg/modules/libexa.so (0x7fecfef66000+0x1409d) [0x7fecfef7a09d]
[ 145.406] 16: /usr/lib64/xorg/modules/libexa.so (0x7fecfef66000+0xff9c) [0x7fecfef75f9c]
[ 145.406] 17: /usr/bin/Xorg (0x400000+0xd5f73) [0x4d5f73]
[ 145.406] 18: /usr/lib64/xorg/modules/libexa.so (0x7fecfef66000+0x11348) [0x7fecfef77348]
[ 145.406] 19: /usr/bin/Xorg (0x400000+0xd12a1) [0x4d12a1]
[ 145.407] 20: /usr/bin/Xorg (0x400000+0x2d431) [0x42d431]
[ 145.407] 21: /usr/bin/Xorg (0x400000+0x2148e) [0x42148e]
[ 145.407] 22: /lib64/libc.so.6 (__libc_start_main+0xfd) [0x328ee1ecfd]
[ 145.407] 23: /usr/bin/Xorg (0x400000+0x21039) [0x421039]
Can I see your *full* Xorg.0.log, as well as dmesg output from after a hang.
Created attachment 439559 [details]
Xorg log after a freeze
Created attachment 439560 [details]
dmesg after a freeze
Created attachment 439561 [details]
This is a duplicate of 596330, however, I'll keep this bug open to track it. I'll push an update into F14 to disable acceleration on this chips, just as we do for F13.
*** Bug 629766 has been marked as a duplicate of this bug. ***
ok, I'm booting with nouveau.noaccel=1 now. oddly, this seems to result in my panel notification area being reliably corrupted. i'll attach a screenshot.
Created attachment 446374 [details]
notification area corruption running with nouveau.noaccel=1
I got lots of similar corruption - small bits of windows being tiled over and over again instead of what should be there - while editing the screenshot with gimp. this doesn't happen with acceleration enabled.
Hmm, that sounds like an unrelated bug. Perhaps in pixman or the X server? Nouveau doesn't do any drawing at all with noaccel enabled.
I haven't had any random hangs with nouveau.noaccel, so this is 'fixed' as long as the update's been pushed. Has it? I lose track of what kernel version you put it into.
(Any luck tracking down the actual cause of this so we can turn acceleration back on?)
I too have not seen the lockups. However, noaccel is really just a workaround. Is anyone working on debugging the cause of these lockups?
I look at the issue every time I think of something else to look at, but as of yet neither me, nor another nouveau developer who's been looking at the issue have any idea why the card's even hanging.
How sad, considering that, apart from the lockups, it actually works pretty well...
yeah. btw, ben, do adapters which are blacklisted to use noaccel broadcast it with some obvious key phrase in the logs, so we can catch why acceleration is disabled when mystified people complain about it in the forums? :)
* Thu Sep 30 2010 Ben Skeggs <email@example.com> 126.96.36.199-36
- nouveau: fix theoretical race condition which may be the cause of some random hangs people reported.
I'm guessing this is related. How can we test this? What can I do to help debug this?
(In reply to comment #16)
> * Thu Sep 30 2010 Ben Skeggs <firstname.lastname@example.org> 188.8.131.52-36
> - nouveau: fix theoretical race condition which may be the cause of some
> random hangs people reported.
> I'm guessing this is related. How can we test this? What can I do to help
> debug this?
It's not related.
(In reply to comment #15)
> yeah. btw, ben, do adapters which are blacklisted to use noaccel broadcast it
> with some obvious key phrase in the logs, so we can catch why acceleration is
> disabled when mystified people complain about it in the forums? :)
I can probably add something to xorg-x11-drv-nouveau to give a nicer message than "Failed to open GPU channel" I guess?
yeah, that would probably help.
Fedora Bugzappers volunteer triage team
(In reply to comment #18)
> yeah, that would probably help.
> Fedora Bugzappers volunteer triage team
*** Bug 652224 has been marked as a duplicate of this bug. ***
*** Bug 661928 has been marked as a duplicate of this bug. ***
As a non-driver-programmer, is there any info or testing that I can provide that would be helpful?
(In reply to comment #22)
> As a non-driver-programmer, is there any info or testing that I can provide
> that would be helpful?
What he said.
Is there a way to force acceleration on so I can try to gather dumps, backtraces, etc?
Also, is there an upstream (X.org) bug for this?
I've raise the priority of this ticket. In rawhide booting into GNOME now results in a completely usable desktop when nouveau.noaccel=1, but nouveau.noaccel=0 results in frequent lockups. I'm willing to do whatever I can do to help solve it (dumps, etc). Please advise.
This ticket is not for Rawhide. Please don't confuse this report with issues that relate only to Rawhide.
The issue in Rawhide isn't really a driver issue; it's that GNOME 3 is not yet smart enough to fall back from GNOME Shell to gnome-panel if a driver which should support the necessary OpenGL/compositing stuff is installed, but acceleration is disabled. So reporting it in this bug will only confuse matters. There's nothing inherently wrong with the nouveau driver with noaccel, still.
Please don't change the priority field. Per Fedora policy, it is restricted to the package maintainer.
Fedora Bugzappers volunteer triage team
I realize this ticket isn't for Rawhide, but the same problem exists in rawhide. Should I file a separate ticket?
Why was GNOME Shell made the default if it doesn't have the smarts to fall back to something sensible? Won't this just feed the trolls who insist that shell isn't ready?
Lastly, I think my request for a link to an upstream bug is sensible. Does upstream even know about this issue?
"Why was GNOME Shell made the default if it doesn't have the smarts to fall back
to something sensible? Won't this just feed the trolls who insist that shell
Of course it's not ready. We haven't even released F15 Alpha yet. Working on the fallback mechanism is one of the main points of effort for the desktop team ATM.
Theoretically you should file another ticket for the fallback issue, but actually the GNOME developers are already aware of it. If there's no actual report yet, though, it's probably worth filing one to make sure it can be tracked.
Fedora Bugzappers volunteer triage team
As discussed earlier on #nouveau, the bug appears to be solved for me by using a 2.6.38 kernel, and an out-of-tree build of nouveau. Myself I have not had any lockups in almost 5 days, with compiz running. The "fixing" (knock on wood) patch then is this one: http://cgit.freedesktop.org/nouveau/linux-2.6/commit/?id=25c68aef4e6abcc3c10f593fc565c342ebe2ded8 .
Chances are that the fix is a combination of this patch and earlier ones. What are the odds of finding a kernel in Fedora 15 with Nouveau worked up to this patch?
f15 should usually have pretty much bang up-to-date nouveau. have you tried an f15 live image to see how it runs?
Well, not I have not tried this live image. The patch I linked to was written and pushed to the nouveau code base yesterday. It's not likely this is included in a Fedora 15 live image as of today. If it wasn't for the GCC4.6 rebuilds I would grab a fresh Fedora 15 kernel for testing purposes, but I don't dare using that in a GCC4.5 environment.
The kernel I am currently running, and have been running in the past month, is http://koji.fedoraproject.org/koji/buildinfo?buildID=217874 . As stated with an out of tree nouveau build from freedesktop.org. Using the module provided with this kernel, or the module I compiled myself before this patch was out (a few weeks ago), the system would lock up randomly every 0,5 to 4 hours, mostly depending on the usage. With this patch, I have not seen it lock up in 5 days.
ah, I didn't check the date. I expect it'll get downstreamed into Fedora soon, then. Ben?
Damn, damn, damn. It locked up.
[26807.636214] [drm] nouveau 0000:01:00.0: PGRAPH TLB flush idle timeout fail: 0x00b00003 0x00000000 0x00001068 0x00000000
[26809.638360] [drm] nouveau 0000:01:00.0: PGRAPH TLB flush idle timeout fail: 0x00b00003 0x00000000 0x00001068 0x00000000
And according to rnn, these regs have not been reverse-engineered yet. Well, at least there's an error message now, so there is a lead. And besides, it's way more stable, so I think it's a win nonetheless.
Roy, a 10x improvement is nothing to snicker at! :)
The patch does apply cleanly (with a small offset) to the latest F15 kernel and seems to be relatively self contained. I've kicked off a scratch build of the latest F15 kernel + nouveau patch for F14 here: http://koji.fedoraproject.org/koji/taskinfo?taskID=2957862
Please test it. :) Hopefully this will be essentially a drop in replacement and we can backport it to existing kernels.
I doubt just dropping that patch into F14 *will* improve at all. I did try this on an nva3 I had for a while, and it improved nothing. It's likely the VM overhaul in later nouveau versions is *also* required to help with this bug. That said, the patch will go into f15 anyway at some point.
For the record, I've been running this patch for at least 24 hours with no freezes. It is a *massive* improvement (at least an order of magnitude).
Apr 1 17:01:35 Torres kernel: [ 1124.477920] [drm] nouveau 0000:01:00.0: PGRAPH TLB flush idle timeout fail: 0x01b00003 0x00000000 0x00001068 0x00200000
Apr 3 22:25:37 Torres kernel: [33822.122010] [drm] nouveau 0000:01:00.0: PGRAPH TLB flush idle timeout fail: 0x00b00003 0x00000000 0x00005068 0x00000000
There's patterns in the error messages on lockup, even though they are not 1-on-1 equal.
This issue appears to be fixed, at least for me, with Fedora 15
Yes, it probably should be okay in F15. It's stupidly hard to say. We don't really know what fixed it, it was likely a combination of a lot of different changes in nouveau.
Anyway, this won't be getting fixed in F14. Closing :)
Sorry but I use F16 and still having freezing system with system message about nouveau :-(
It appears only in gnome 3 - gnome-shell.
Working with LXDE is OK