Bugzilla will be upgraded to version 5.0 on a still to be determined date in the near future. The original upgrade date has been delayed.
Bug 625187 - [NVa5] F14: NVA3/NVA5/NVA8 issues (random hang, no 3d, no Xv)
[NVa5] F14: NVA3/NVA5/NVA8 issues (random hang, no 3d, no Xv)
Status: CLOSED NEXTRELEASE
Product: Fedora
Classification: Fedora
Component: xorg-x11-drv-nouveau (Show other bugs)
14
All Linux
low Severity medium
: ---
: ---
Assigned To: Ben Skeggs
Fedora Extras Quality Assurance
: Patch, Triaged, Upstream
: 629766 652224 661928 (view as bug list)
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2010-08-18 15:46 EDT by Nathaniel McCallum
Modified: 2012-03-28 04:00 EDT (History)
15 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2011-08-09 18:23:52 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
Xorg log after a freeze (68.72 KB, text/plain)
2010-08-18 21:21 EDT, Nathaniel McCallum
no flags Details
dmesg after a freeze (57.30 KB, text/plain)
2010-08-18 21:26 EDT, Nathaniel McCallum
no flags Details
lspci (43.60 KB, text/plain)
2010-08-18 21:27 EDT, Nathaniel McCallum
no flags Details
notification area corruption running with nouveau.noaccel=1 (7.79 KB, image/png)
2010-09-09 16:45 EDT, Adam Williamson
no flags Details


External Trackers
Tracker ID Priority Status Summary Last Updated
FreeDesktop.org 26980 None None None Never

  None (edit)
Description Nathaniel McCallum 2010-08-18 15:46:38 EDT
From Xorg.0.log:
[   145.403] [mi] EQ overflowing. The server is probably stuck in an infinite loop.
[   145.404] 
Backtrace:
[   145.404] 0: /usr/bin/Xorg (xorg_backtrace+0x28) [0x49ffa8]
[   145.404] 1: /usr/bin/Xorg (mieqEnqueue+0x1f4) [0x49f4d4]
[   145.404] 2: /usr/bin/Xorg (xf86PostMotionEventP+0xc4) [0x47c274]
[   145.404] 3: /usr/lib64/xorg/modules/input/evdev_drv.so (0x7fecfdebd000+0x4265) [0x7fecfdec1265]
[   145.405] 4: /usr/bin/Xorg (0x400000+0x6a157) [0x46a157]
[   145.405] 5: /usr/bin/Xorg (0x400000+0x1185f3) [0x5185f3]
[   145.405] 6: /lib64/libc.so.6 (0x328ee00000+0x33f80) [0x328ee33f80]
[   145.405] 7: /lib64/libc.so.6 (ioctl+0x7) [0x328eed98e7]
[   145.405] 8: /usr/lib64/libdrm.so.2 (drmIoctl+0x28) [0x329c603388]
[   145.405] 9: /usr/lib64/libdrm.so.2 (drmCommandWrite+0x1b) [0x329c60360b]
[   145.405] 10: /usr/lib64/libdrm_nouveau.so.1 (0x7fecff9b0000+0x2dfd) [0x7fecff9b2dfd]
[   145.405] 11: /usr/lib64/libdrm_nouveau.so.1 (nouveau_bo_map_range+0xfe) [0x7fecff9b2fee]
[   145.406] 12: /usr/lib64/xorg/modules/drivers/nouveau_drv.so (0x7fecffbb5000+0x5cd2) [0x7fecffbbacd2]
[   145.406] 13: /usr/lib64/xorg/modules/libexa.so (0x7fecfef66000+0x5f97) [0x7fecfef6bf97]
[   145.406] 14: /usr/lib64/xorg/modules/libexa.so (0x7fecfef66000+0x8bc2) [0x7fecfef6ebc2]
[   145.406] 15: /usr/lib64/xorg/modules/libexa.so (0x7fecfef66000+0x1409d) [0x7fecfef7a09d]
[   145.406] 16: /usr/lib64/xorg/modules/libexa.so (0x7fecfef66000+0xff9c) [0x7fecfef75f9c]
[   145.406] 17: /usr/bin/Xorg (0x400000+0xd5f73) [0x4d5f73]
[   145.406] 18: /usr/lib64/xorg/modules/libexa.so (0x7fecfef66000+0x11348) [0x7fecfef77348]
[   145.406] 19: /usr/bin/Xorg (0x400000+0xd12a1) [0x4d12a1]
[   145.407] 20: /usr/bin/Xorg (0x400000+0x2d431) [0x42d431]
[   145.407] 21: /usr/bin/Xorg (0x400000+0x2148e) [0x42148e]
[   145.407] 22: /lib64/libc.so.6 (__libc_start_main+0xfd) [0x328ee1ecfd]
[   145.407] 23: /usr/bin/Xorg (0x400000+0x21039) [0x421039]
Comment 1 Ben Skeggs 2010-08-18 18:54:46 EDT
Can I see your *full* Xorg.0.log, as well as dmesg output from after a hang.

Thanks.
Comment 2 Nathaniel McCallum 2010-08-18 21:21:45 EDT
Created attachment 439559 [details]
Xorg log after a freeze
Comment 3 Nathaniel McCallum 2010-08-18 21:26:16 EDT
Created attachment 439560 [details]
dmesg after a freeze
Comment 4 Nathaniel McCallum 2010-08-18 21:27:34 EDT
Created attachment 439561 [details]
lspci
Comment 5 Ben Skeggs 2010-09-03 02:44:45 EDT
This is a duplicate of 596330, however, I'll keep this bug open to track it.  I'll push an update into F14 to disable acceleration on this chips, just as we do for F13.
Comment 6 Ben Skeggs 2010-09-03 02:52:42 EDT
*** Bug 629766 has been marked as a duplicate of this bug. ***
Comment 7 Adam Williamson 2010-09-09 16:43:46 EDT
ok, I'm booting with nouveau.noaccel=1 now. oddly, this seems to result in my panel notification area being reliably corrupted. i'll attach a screenshot.
Comment 8 Adam Williamson 2010-09-09 16:45:49 EDT
Created attachment 446374 [details]
notification area corruption running with nouveau.noaccel=1
Comment 9 Adam Williamson 2010-09-09 16:46:26 EDT
I got lots of similar corruption - small bits of windows being tiled over and over again instead of what should be there - while editing the screenshot with gimp. this doesn't happen with acceleration enabled.
Comment 10 Ben Skeggs 2010-09-09 18:33:09 EDT
Hmm, that sounds like an unrelated bug.  Perhaps in pixman or the X server?  Nouveau doesn't do any drawing at all with noaccel enabled.
Comment 11 Adam Williamson 2010-10-07 15:46:02 EDT
I haven't had any random hangs with nouveau.noaccel, so this is 'fixed' as long as the update's been pushed. Has it? I lose track of what kernel version you put it into.

(Any luck tracking down the actual cause of this so we can turn acceleration back on?)
Comment 12 Nathaniel McCallum 2010-10-07 15:53:42 EDT
I too have not seen the lockups.  However, noaccel is really just a workaround.  Is anyone working on debugging the cause of these lockups?
Comment 13 Ben Skeggs 2010-10-07 19:07:09 EDT
I look at the issue every time I think of something else to look at, but as of yet neither me, nor another nouveau developer who's been looking at the issue have any idea why the card's even hanging.
Comment 14 Nathaniel McCallum 2010-10-07 22:02:38 EDT
How sad, considering that, apart from the lockups, it actually works pretty well...
Comment 15 Adam Williamson 2010-10-08 11:39:37 EDT
yeah. btw, ben, do adapters which are blacklisted to use noaccel broadcast it with some obvious key phrase in the logs, so we can catch why acceleration is disabled when mystified people complain about it in the forums? :)
Comment 16 Nathaniel McCallum 2010-10-08 13:22:24 EDT
* Thu Sep 30 2010 Ben Skeggs <bskeggs@redhat.com> 2.6.35.6-36
 - nouveau: fix theoretical race condition which may be the cause of some random hangs people reported.

I'm guessing this is related.  How can we test this?  What can I do to help debug this?
Comment 17 Ben Skeggs 2010-10-10 19:35:13 EDT
(In reply to comment #16)
> * Thu Sep 30 2010 Ben Skeggs <bskeggs@redhat.com> 2.6.35.6-36
>  - nouveau: fix theoretical race condition which may be the cause of some
> random hangs people reported.
> 
> I'm guessing this is related.  How can we test this?  What can I do to help
> debug this?

It's not related.

(In reply to comment #15)
> yeah. btw, ben, do adapters which are blacklisted to use noaccel broadcast it
> with some obvious key phrase in the logs, so we can catch why acceleration is
> disabled when mystified people complain about it in the forums? :)

I can probably add something to xorg-x11-drv-nouveau to give a nicer message than "Failed to open GPU channel" I guess?
Comment 18 Adam Williamson 2010-10-10 20:12:21 EDT
yeah, that would probably help.



-- 
Fedora Bugzappers volunteer triage team
https://fedoraproject.org/wiki/BugZappers
Comment 19 Ben Skeggs 2010-10-10 21:04:38 EDT
(In reply to comment #18)
> yeah, that would probably help.
> 
> 
> 
> -- 
> Fedora Bugzappers volunteer triage team
> https://fedoraproject.org/wiki/BugZappers

Done: http://koji.fedoraproject.org/koji/buildinfo?buildID=199899
Comment 20 Ben Skeggs 2010-12-09 17:42:10 EST
*** Bug 652224 has been marked as a duplicate of this bug. ***
Comment 21 Ben Skeggs 2010-12-09 20:38:13 EST
*** Bug 661928 has been marked as a duplicate of this bug. ***
Comment 22 Matthew Truch 2010-12-10 22:37:58 EST
As a non-driver-programmer, is there any info or testing that I can provide that would be helpful?
Comment 23 Ian Pilcher 2011-01-04 15:19:24 EST
(In reply to comment #22)
> As a non-driver-programmer, is there any info or testing that I can provide
> that would be helpful?

What he said.

Is there a way to force acceleration on so I can try to gather dumps, backtraces, etc?

Also, is there an upstream (X.org) bug for this?
Comment 24 Nathaniel McCallum 2011-01-14 14:02:42 EST
I've raise the priority of this ticket.  In rawhide booting into GNOME now results in a completely usable desktop when nouveau.noaccel=1, but nouveau.noaccel=0 results in frequent lockups.  I'm willing to do whatever I can do to help solve it (dumps, etc).  Please advise.
Comment 25 Nathaniel McCallum 2011-01-14 14:03:05 EST
s/usable/unusable/
Comment 26 Adam Williamson 2011-01-14 14:14:16 EST
This ticket is not for Rawhide. Please don't confuse this report with issues that relate only to Rawhide.

The issue in Rawhide isn't really a driver issue; it's that GNOME 3 is not yet smart enough to fall back from GNOME Shell to gnome-panel if a driver which should support the necessary OpenGL/compositing stuff is installed, but acceleration is disabled. So reporting it in this bug will only confuse matters. There's nothing inherently wrong with the nouveau driver with noaccel, still.

Please don't change the priority field. Per Fedora policy, it is restricted to the package maintainer.



-- 
Fedora Bugzappers volunteer triage team
https://fedoraproject.org/wiki/BugZappers
Comment 27 Nathaniel McCallum 2011-01-14 14:32:41 EST
I realize this ticket isn't for Rawhide, but the same problem exists in rawhide.  Should I file a separate ticket?

Why was GNOME Shell made the default if it doesn't have the smarts to fall back to something sensible?  Won't this just feed the trolls who insist that shell isn't ready?

Lastly, I think my request for a link to an upstream bug is sensible.  Does upstream even know about this issue?
Comment 28 Adam Williamson 2011-01-14 15:40:34 EST
"Why was GNOME Shell made the default if it doesn't have the smarts to fall back
to something sensible?  Won't this just feed the trolls who insist that shell
isn't ready?"

Of course it's not ready. We haven't even released F15 Alpha yet. Working on the fallback mechanism is one of the main points of effort for the desktop team ATM.

Theoretically you should file another ticket for the fallback issue, but actually the GNOME developers are already aware of it. If there's no actual report yet, though, it's probably worth filing one to make sure it can be tracked.



-- 
Fedora Bugzappers volunteer triage team
https://fedoraproject.org/wiki/BugZappers
Comment 29 Roy 2011-03-29 12:05:10 EDT
As discussed earlier on #nouveau, the bug appears to be solved for me by using a 2.6.38 kernel, and an out-of-tree build of nouveau. Myself I have not had any lockups in almost 5 days, with compiz running. The "fixing" (knock on wood) patch then is this one: http://cgit.freedesktop.org/nouveau/linux-2.6/commit/?id=25c68aef4e6abcc3c10f593fc565c342ebe2ded8 .
Chances are that the fix is a combination of this patch and earlier ones. What are the odds of finding a kernel in Fedora 15 with Nouveau worked up to this patch?
Comment 30 Adam Williamson 2011-03-29 12:38:36 EDT
f15 should usually have pretty much bang up-to-date nouveau. have you tried an f15 live image to see how it runs?
Comment 31 Roy 2011-03-29 13:33:40 EDT
Well, not I have not tried this live image. The patch I linked to was written and pushed to the nouveau code base yesterday. It's not likely this is included in a Fedora 15 live image as of today. If it wasn't for the GCC4.6 rebuilds I would grab a fresh Fedora 15 kernel for testing purposes, but I don't dare using that in a GCC4.5 environment.
The kernel I am currently running, and have been running in the past month, is http://koji.fedoraproject.org/koji/buildinfo?buildID=217874 . As stated with an out of tree nouveau build from freedesktop.org. Using the module provided with this kernel, or the module I compiled myself before this patch was out (a few weeks ago), the system would lock up randomly every 0,5 to 4 hours, mostly depending on the usage. With this patch, I have not seen it lock up in 5 days.
Comment 32 Adam Williamson 2011-03-29 13:55:08 EDT
ah, I didn't check the date. I expect it'll get downstreamed into Fedora soon, then. Ben?
Comment 33 Roy 2011-03-29 14:17:47 EDT
Damn, damn, damn. It locked up.

#
[26807.636214] [drm] nouveau 0000:01:00.0: PGRAPH TLB flush idle timeout fail: 0x00b00003 0x00000000 0x00001068 0x00000000
#
[26809.638360] [drm] nouveau 0000:01:00.0: PGRAPH TLB flush idle timeout fail: 0x00b00003 0x00000000 0x00001068 0x00000000

And according to rnn, these regs have not been reverse-engineered yet. Well, at least there's an error message now, so there is a lead. And besides, it's way more stable, so I think it's a win nonetheless.
Comment 34 Nathaniel McCallum 2011-03-29 15:04:22 EDT
Roy, a 10x improvement is nothing to snicker at! :)

The patch does apply cleanly (with a small offset) to the latest F15 kernel and seems to be relatively self contained.  I've kicked off a scratch build of the latest F15 kernel + nouveau patch for F14 here: http://koji.fedoraproject.org/koji/taskinfo?taskID=2957862

Please test it. :) Hopefully this will be essentially a drop in replacement and we can backport it to existing kernels.
Comment 35 Ben Skeggs 2011-03-29 17:56:00 EDT
I doubt just dropping that patch into F14 *will* improve at all.  I did try this on an nva3 I had for a while, and it improved nothing.  It's likely the VM overhaul in later nouveau versions is *also* required to help with this bug.  That said, the patch will go into f15 anyway at some point.
Comment 36 Nathaniel McCallum 2011-03-30 22:17:10 EDT
For the record, I've been running this patch for at least 24 hours with no freezes.  It is a *massive* improvement (at least an order of magnitude).
Comment 37 Roy 2011-04-03 16:43:35 EDT
Apr  1 17:01:35 Torres kernel: [ 1124.477920] [drm] nouveau 0000:01:00.0: PGRAPH TLB flush idle timeout fail: 0x01b00003 0x00000000 0x00001068 0x00200000

Apr  3 22:25:37 Torres kernel: [33822.122010] [drm] nouveau 0000:01:00.0: PGRAPH TLB flush idle timeout fail: 0x00b00003 0x00000000 0x00005068 0x00000000

There's patterns in the error messages on lockup, even though they are not 1-on-1 equal.
Comment 38 Roy 2011-08-09 04:33:14 EDT
This issue appears to be fixed, at least for me, with Fedora 15
Comment 39 Ben Skeggs 2011-08-09 18:23:52 EDT
Yes, it probably should be okay in F15.  It's stupidly hard to say.  We don't really know what fixed it, it was likely a combination of a lot of different changes in nouveau.

Anyway, this won't be getting fixed in F14.  Closing :)
Comment 40 Jacek Pietrewicz 2012-03-28 04:00:17 EDT
Sorry but I use F16 and still having freezing system with system message about nouveau :-(
It appears only in gnome 3 - gnome-shell. 
Working with LXDE is OK

Note You need to log in before you can comment on or make changes to this bug.