Bug 684907

Summary: [NVa8] X freezes frequently
Product: [Fedora] Fedora Reporter: Horst H. von Brand <vonbrand>
Component: xorg-x11-drv-nouveauAssignee: Ben Skeggs <bskeggs>
Status: CLOSED CURRENTRELEASE QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: high Docs Contact:
Priority: unspecified    
Version: 15CC: aaron, airlied, ajax, bskeggs, xgl-maint
Target Milestone: ---Keywords: Triaged
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-04-25 13:46:23 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
/var/log/messages from last boot (no hang)
none
Last /var/log/Xorg.0.log
none
~/.xsession errors from current session (no hang up to here)
none
Outoput from dmesg just after a hang
none
Output from gdb
none
/varLog/messages, last two boots
none
/var/log/messages, last two boots none

Description Horst H. von Brand 2011-03-14 18:44:36 UTC
Created attachment 484281 [details]
/var/log/messages from last boot (no hang)

Description of problem:
X freezes around 2 or 3 times a day here. This has happened for quite some time (some 2 months),  but I was on rawhide ~~-> 16, and the machine did misbehave in an entertaining variety of ways. I reinstalled from scratch last weekend, and the problem persists.

What happens is that the screen freezes completely (no date/time update, no reaction to keyboard like ctrl-alt-DEL or ctrl-BS). I can move the mouse pointer, but once when the spinner was active it didn't even spin. The keyboard LEDs don't react either (i.e., CapsLock doesn't turn the LED on).

Curiously this has happened on my two Toshiba notebooks, this one here with Nouveau, the other one with some oldish intel GPU (haven't checked that one lately, sorry). I also have a Samsung netbook, that one isn't prone to freezing.

Version-Release number of selected component (if applicable):
xorg-x11-drv-nouveau-0.0.16-23.20110303git92db2bc.fc15.x86_64
xorg-x11-server-Xorg-1.10.0-3.fc15.x86_64

01:00.0 VGA compatible controller [0300]: nVidia Corporation Device [10de:0a75] (rev a2)

How reproducible:
Around 2 to 3 times a day

Steps to Reproduce:
1. (Happens at random; once even while the screensaver was running)
2.
3.
  
Actual results:


Expected results:


Additional info:

Comment 1 Horst H. von Brand 2011-03-14 18:45:50 UTC
Created attachment 484282 [details]
Last /var/log/Xorg.0.log

Comment 2 Horst H. von Brand 2011-03-14 18:47:04 UTC
Created attachment 484283 [details]
~/.xsession errors from current session (no hang up to here)

Comment 3 Ben Skeggs 2011-03-15 02:47:19 UTC
Can you paste your dmesg output from *after* a hang has happened.  If you don't have a way of accessing it after it's hung, /var/log/messages from right after you reboot will do in its place.

Comment 4 Horst H. von Brand 2011-03-15 13:06:39 UTC
Created attachment 484463 [details]
Outoput from dmesg just after a hang

Comment 5 Horst H. von Brand 2011-03-15 13:10:12 UTC
Created attachment 484464 [details]
Output from gdb

Via SSH I ran "gdb /usr/bin/Xorg 1230" as root, and did an "info stack"

The output from "pd -p 1230 -l" is:
F S   UID   PID  PPID  C PRI  NI ADDR SZ WCHAN  TTY          TIME CMD
4 S     0  1320  1317  1  80   0 - 43891 nouvea tty1     00:01:01 Xorg

The process Xorg still used CPU, albeit slowly.

"kill 1320" and such had no effect. Tried to "telinit 3", nothing. A "reboot" finally rebooted the machine, but it took some time to react.

Comment 6 Horst H. von Brand 2011-03-15 13:23:57 UTC
Connected via SSH (dead keyboard), as root in the above. Tried to "kill 1320" and such, nothing; tried "telinit 3", nothing; "reboot" finally worked (but it took some 30 seconds to kick in).

BTW, I had just restarted the machine and was editing a comment just like this one when it froze again...

Comment 7 Horst H. von Brand 2011-03-15 19:39:34 UTC
Just froze again...

Comment 8 Horst H. von Brand 2011-03-15 20:40:34 UTC
And again. Not funny anymore...

Comment 9 Ben Skeggs 2011-03-15 22:44:46 UTC
Can you add "nouveau.noaccel=1" to your boot options.  It's likely that this will become the default for F15 if this problem still persists.  There was some hope it was solved.  I have a NVA8 laptop now, and see *no* problem.

Comment 10 Horst H. von Brand 2011-03-16 01:21:19 UTC
Froze twice in the meantime... now as suggested in comment 9. Just Gnome fallback mode now :-(

Comment 11 Ben Skeggs 2011-03-16 03:07:14 UTC
Yes, the problem is *very* frustrating.  We have zero real idea about what is wrong.  I had hoped that this laptop would actually have the problem, so I could use a trial-and-error approach to fixing it, but for some odd reason it does not... Very, very frustrating.

Comment 12 Horst H. von Brand 2011-03-16 14:37:02 UTC
The smolt profile for this machine is pub_258fc546-3757-4b93-8019-c3ff4fa31e90 (Toshiba Satellite A505 PSAT9U-009LM1).

Tell if I can be of some help.

BTW, no freezes since the change suggested in comment 9 (but I moved to XFCE, Gnome fallback isn't up to snuff).

Comment 13 Horst H. von Brand 2011-03-16 16:07:40 UTC
Created attachment 485781 [details]
/varLog/messages, last two boots

Even with the change suggested in comment 9 I got a hang. Sorry, I rebooted before remembering to do the capture dance with dmesg et al. Attached the full /var/log/messages for the hung session (XFCE4, no screensaver; I was off for lunch and had nothing running) and the messages for the current session.

Comment 14 Ben Skeggs 2011-03-16 22:16:14 UTC
Oh good (for me, anyway), it doesn't seem likely nouveau can be responsible for your hang in that case.  With "noaccel", after the mode is set, nouveau doesn't really do anything.

Your /var/log/messages is filled with a *lot* of messages from something else, not sure what it is, but it's looking like the next candidate to me.  Can you "modprobe -r intel_ips" after you've booted?

Comment 15 Horst H. von Brand 2011-03-17 00:44:33 UTC
(In reply to comment #14)
> Oh good (for me, anyway), it doesn't seem likely nouveau can be responsible for
> your hang in that case.  With "noaccel", after the mode is set, nouveau doesn't
> really do anything.

It is just a _lot_ less frequent (1/day vs 5/day), so I wouldn't rule it out just yet. 

> Your /var/log/messages is filled with a *lot* of messages from something else,
> not sure what it is, but it's looking like the next candidate to me.  Can you
> "modprobe -r intel_ips" after you've booted?

Just checked, it is loaded. Will remove that one and go with accelerated nouveau...

Comment 16 Horst H. von Brand 2011-03-17 01:00:53 UTC
Rebooted, now in XFCE. Curiously, both times I ran non-accelerated with intel_ips I got a messed up taskbar (see bug # 688254).

Comment 17 Horst H. von Brand 2011-03-17 01:49:01 UTC
Created attachment 485886 [details]
/var/log/messages, last two boots

One run without intel_ips and with accelerated nouveau (hung like clockwork ;-) and now with intel_ips and acceleration.

(Sorry, haven't anything handy for ssh connection to this machine right now).

Will remove intel_ips again and see what happens.

Comment 18 Horst H. von Brand 2011-03-17 03:23:01 UTC
No freeze til now. Need to shut down for today.

Comment 19 Horst H. von Brand 2011-03-17 18:11:56 UTC
Started this machine up around 9 AM, now it is 3 PM. No freezes, Nouveau accelerated (no kernel parameter), intel_ips rmmod'ed soon after boot. So it looks like the culprit is really intel_ips (normally it would have frozen 2 or 3 times by now).

Comment 20 Ben Skeggs 2011-03-17 22:29:12 UTC
Interesting, to clarify the situation before I reassign this to the kernel:

nouveau.noaccel=0 + intel_ips = hang
nouveau.noaccel=1 + intel_ips = hang
nouveau.noaccel=0 - intel_ips = no hang
nouveau.noaccel=1 - intel_ips = no hang

Comment 21 Horst H. von Brand 2011-03-18 12:46:22 UTC
It is a bit more complicated than that...

Some notation: NA == Nouveau, accelerated (+/-); II == intel=ips enabled (+/-)

Then:

 NA+ II+   Hangs rather reliably (within a half hour or so right now)
 NA- II+   Hangs, but less frequently
 NA+ II-   No hangs (*)
 NA- II-   Not tested

(*) It did hang twice for me, but the second time (it happened today) I had started my XFCE session before remembering to disable II; I killed the (starting) session with ctrl-BS, and on tty2 I rmmod'ed intel_ips. Shortly after logging in again it hung. It seems that the II- has to be before there was any serious use of X, i.e. very soon after boot.

I did run the full day yesterday (9 AM to around 6 PM) without any hang with NA+ II- (Using XFCE, presumably not so 3D-demanding? But then again, Gnome vs XFCE didn't make much of a difference before...)

Comment 22 Horst H. von Brand 2011-03-18 18:57:09 UTC
Count another day without incident (from around 9 AM to 4 PM or so), NA+, II- (but II- set right after a cold boot).

Comment 23 Aaron Sowry 2011-03-19 13:43:07 UTC
For me, NA+ II- still results in lock-ups, even with the intel_ips module blacklisted.

Comment 24 Horst H. von Brand 2011-03-19 17:12:15 UTC
Got another hang with NA+ II-, setup just like comment 22.

Comment 25 Horst H. von Brand 2011-03-19 23:39:07 UTC
(In reply to comment #24)
> Got another hang with NA+ II-, setup just like comment 22.

I updated rsyslog at that time, which broke the system (see bug 689121), that might have been the cause for the hang (I tried to login via SSH, but got no response and had to reboot the hard way).

Comment 26 Horst H. von Brand 2011-03-20 21:34:26 UTC
NA+, II-. Uptime is 22 hours.

Comment 27 Aaron Sowry 2011-03-22 15:49:44 UTC
NA+, II-. Two crashes in <24h.

Comment 28 Horst H. von Brand 2011-03-22 18:31:35 UTC
Now kernel-2.6.38-1.fc15.x86_64, xorg-x11-drv-nouveau-0.0.16-23.20110303git92db2bc.fc15.x86_64. By accident I left out the "disable intel_ips" bit a few times, and it has hung much less (some 5 hours between hangs vs less than half an hour).

Comment 29 Ben Skeggs 2011-03-22 22:48:16 UTC
(In reply to comment #28)
> Now kernel-2.6.38-1.fc15.x86_64,
> xorg-x11-drv-nouveau-0.0.16-23.20110303git92db2bc.fc15.x86_64. By accident I
> left out the "disable intel_ips" bit a few times, and it has hung much less
> (some 5 hours between hangs vs less than half an hour).

Still hanging then, even with noaccel?  Okay.. I really doubt what you're seeing is nouveau's fault now.  It's still possible though I guess.

Any chance you can brave using the vesa driver for a day or so?  Just add "nomodeset" to your boot options and X should automatically fall back to it.

Comment 30 Horst H. von Brand 2011-03-23 17:51:48 UTC
(In reply to comment #29)
> (In reply to comment #28)
> > Now kernel-2.6.38-1.fc15.x86_64,
> > xorg-x11-drv-nouveau-0.0.16-23.20110303git92db2bc.fc15.x86_64. By accident I
> > left out the "disable intel_ips" bit a few times, and it has hung much less
> > (some 5 hours between hangs vs less than half an hour).
> 
> Still hanging then, even with noaccel?  Okay.. I really doubt what you're
> seeing is nouveau's fault now.  It's still possible though I guess.

No, this is without noaccel and with intel_ips, but the newer kernel is _much_ less prone to X hangs (current uptime is almost 5 hours, no hang; used to freeze at most after an hour or so).

Comment 31 Horst H. von Brand 2011-03-27 03:31:52 UTC
Had a couple of hangs with NA+ II- (kernel-2.6.38-1.fc15.x86_64 and kernel-2.6.38.1-6.fc15.x86_64, xorg-x11-drv-nouveau-0.0.16-24.20110324git8378443.fc15.x86_64)

Comment 32 Ben Skeggs 2011-03-27 22:19:53 UTC
(In reply to comment #31)
> Had a couple of hangs with NA+ II- (kernel-2.6.38-1.fc15.x86_64 and
> kernel-2.6.38.1-6.fc15.x86_64,
> xorg-x11-drv-nouveau-0.0.16-24.20110324git8378443.fc15.x86_64)

If you're still getting hangs with noaccel (I presume that is NA+?), I think it's probably almost time to reassign this elsewhere.  But, where, I'm not exactly sure.  Probably the kernel.  One last ditch effort to rule out nouveau completely, are you able to brave the vesa driver for a bit?

Comment 33 Horst H. von Brand 2011-03-27 22:33:22 UTC
OK, yet again: noaccell == NA- (Nouveau, acceleration off).
               intel_ips == II+ (Intel ips on)

And as I said: This was _without_ noaccell, and with intel_ips rmmod'ed.

Currently my uptime is almost 20 hours, NA+, II- (no noaccell, intel_ips rmmod'ed). Yesterday it hung after 2 hours with the same (old kernel), then with the new kernel it hung after some 1/2 hour.

By the timing, I think this is a different bug than the original one.

Comment 34 Horst H. von Brand 2011-03-28 18:55:34 UTC
A new hang, NA+, II- after some 3 hours.

Currently with nomodeset, II+, uptime 2 1/2 hours.

Comment 35 Horst H. von Brand 2011-03-28 18:59:31 UTC
(In reply to comment #34)
> A new hang, NA+, II- after some 3 hours.
> 
> Currently with nomodeset, II+, uptime 2 1/2 hours.

BTW, I now see strange compositing errors (once XEmacs window was all vertical stripes until I refreshed it; now the XFCE taskbar has the battery icon replicated 4 times and no Bluetooth nor NetworkManager).

Comment 36 Horst H. von Brand 2011-03-30 02:47:42 UTC
Ran most of today with VESA (nomodeset) + intel_ips, no hangs; right now nouveau with acceleration + intel_ips is up for some 4 hours.

kernel-2.6.38.2-8.fc15.x86_64
xorg-x11-drv-nouveau-0.0.16-24.20110324git8378443.fc15.x86_64

Comment 37 Ben Skeggs 2011-03-30 04:30:54 UTC
That's thoroughly confusing.. We don't really *do* anything in the noaccel case.

Can you use noaccel again, and get the X log from that, and a gdb backtrace of where X is stuck when it hangs?

Comment 38 Horst H. von Brand 2011-03-30 22:59:53 UTC
Yet again: The one that made (!) the most difference was intel_ips, it did hang noticeably less without it, noaccell didn't make much of a difference lately. Right now, I've had this machine running all night (but the screensaver did kick in) and all day (mostly away, so also screensaver) with _no_ special configuration at all, no more hangs. I saw the frequency of hanging diminish with kernel version, right now it is kernel-2.6.38.2-8.fc15.x86_64.

VESA looks but-ugly, and has compositing problems.

Comment 39 Horst H. von Brand 2011-03-31 16:06:18 UTC
(In reply to comment #38)
> Yet again: The one that made (!) the most difference was intel_ips, it did hang
> noticeably less without it, noaccell didn't make much of a difference lately.
> Right now, I've had this machine running all night (but the screensaver did
> kick in) and all day (mostly away, so also screensaver) with _no_ special
> configuration at all, no more hangs. I saw the frequency of hanging diminish
> with kernel version, right now it is kernel-2.6.38.2-8.fc15.x86_64.
> 
> VESA looks but-ugly, and has compositing problems.

Argh... scratch that, Had configured the kernel with nomodeset :-(

Comment 40 Horst H. von Brand 2011-03-31 23:14:39 UTC
OK, now really without any special configuration (except for selinux=0 due to /run breakage), running a few hours without any trouble.

Comment 41 Horst H. von Brand 2011-04-01 23:52:45 UTC
Ran most of today without trouble (no special configuration), did hang on shutdown (it looked like it, everything froze and I had to turn the machine off as I had no time to wait).

Recently is seems to have frozen for a few seconds.

If it still happens here, it is certainly after some 5 hours running at the very least.

Comment 42 Horst H. von Brand 2011-04-04 16:43:37 UTC
Updading with latest data: kernel-2.6.38.2-9.fc15.x86_64, xorg-x11-drv-nouveau-0.0.16-24.20110324git8378443.fc15.x86_64. Running a few days with nomodeset (== VESA) gives no hang, running some 36 hours with no special configuration gave 2 (or perhaps 3) hangs. Currently running a few hours withouth intel_ips.

Comment 43 Horst H. von Brand 2011-04-05 12:27:59 UTC
Had a hang yesterday in the afternoon with no special configuration, booted and rmmod'ed intel_ips and had no hang after that. Today I logged into XFCE and after that I rmmod'ed intel_ips, and had a hang a half hour later. Now trying the same again (remembered intel_ips too late ;-)

Comment 44 Horst H. von Brand 2011-04-07 19:47:50 UTC
Yet again: No special configuration, froze after an hour or so; rmmod intel_ips _before_ entering XFCE and running for some 6 hours now. Doing the rmmod _after_ the desktop starts seems to be much less effective.

Comment 45 Horst H. von Brand 2011-04-13 00:23:36 UTC
This is driving me nuts... No special configuration, two days straight (some 20 hours) no hangs, then a hang. Reboot, a new hang 20 minutes later, and again after 10 minutes. Today a hang after running an hour or so, then a hang 15 minutes later. Then hours without a problem, and a new hang.

BTW, the hangs with the short period between them where when I was running something under qemu-kvm (i686 program under x86_64).

Comment 46 Horst H. von Brand 2011-04-14 17:18:43 UTC
Now I rmmod'ed intel_ips before logging in, had a hang (after suspending + waking up, and some 30 minutes of use). Again, a hang with the same configuration today after working some 2 hours. Currently running with nomodeset, no intel_ips (intel_ips now blacklisted, but I doubt it will do much good).

Problem is that the very same configuration can work 2 days straight (some 20 hours) and then hang thrice in an hour. Very frustrating.

(And Murphy's law makes sure it happens at the worst possible moments too, at least that part of the universe is working fine).

Comment 47 Horst H. von Brand 2011-04-15 12:45:29 UTC
Tried with nouveau.noaccell=1, but that doesn't work now: At most 1024x768 (my notebook monitor does 1366x768, the external VGA monitor does 1680x1050; both look awful as you can imagine); can't configure an external monitor at all. It seems nomodeset is better here :-(

Comment 48 Horst H. von Brand 2011-04-17 15:34:53 UTC
OK, new data points: After removing all configuration, it hung after some 20 minutes of use. And then the rest of Friday without trouble, currently (Sunday) uptime is 25 hours (not all in active use, though ;-), no troubles.

Looks like I'll just ride it out. I guess there was some intel_ips problem (now fixed) with the same sympthoms (or which triggered the Nouveau bug somehow).

Comment 49 Aaron Sowry 2011-04-18 07:29:46 UTC
Horst,

Can you please try upgrading to the latest kernel from koji?

http://koji.fedoraproject.org/koji/buildinfo?buildID=238906

Comment 50 Horst H. von Brand 2011-04-19 20:41:00 UTC
Will do. Note that the frequency of freezing is around once a day or so right now, it'll take a few days to confirm this is really fixed.

Comment 51 Horst H. von Brand 2011-04-20 10:49:39 UTC
Have been running a few hours without freezes now.

Comment 52 Aaron Sowry 2011-04-20 12:01:32 UTC
Ben, this should probably be closed as a dupe of bug #684608

Comment 53 Ben Skeggs 2011-04-20 12:13:21 UTC
(In reply to comment #52)
> Ben, this should probably be closed as a dupe of bug #684608

Perhaps.  I left it open as Horst also mentioned hangs with NoAccel.. Which, well, should not happen.. I'm not entirely sure how nouveau *could* cause them with NoAccel turned on...

Comment 54 Horst H. von Brand 2011-04-21 13:53:17 UTC
(In reply to comment #53)
> (In reply to comment #52)
> > Ben, this should probably be closed as a dupe of bug #684608
> 
> Perhaps.  I left it open as Horst also mentioned hangs with NoAccel.. Which,
> well, should not happen.. I'm not entirely sure how nouveau *could* cause them
> with NoAccel turned on...

I did see infrequent hangs with noaccell (I didn't use it much), but that was way back (when hangs without configuration happened each 10 to 20 minutes). Also note that my older notebook (also Toshiba, but with an intel GPU; same Fedora) was prone to freezing the same way, but less. I haven't used it much lately, but it didn't freeze at all recently.

The situation has changed substantially (due to kernel evolution?), so perhaps there were several different bugs at work. And yes, the symptoms are as in bug #684608.

Currently I'm running the kernel suggested in comment 49 for more than a day without incidents. I'll let this weekend pass and call it fixed unless something shows up.

Comment 55 Horst H. von Brand 2011-04-23 19:02:05 UTC
Updated the kernel today. Had kernel-2.6.38.3-15.rc1.fc15.x86_64 running for some 20 hours straight without freezes, now kernel-2.6.38.3-18.fc15.x86_64 for a few hours. I guess this can be closed...

Comment 56 Horst H. von Brand 2011-04-24 21:05:49 UTC
Almost a day with kernel-2.6.38.3-18.fc15.x86_64, no problems up to now.

Comment 57 Horst H. von Brand 2011-04-25 13:25:48 UTC
No further freezes. Any other experiment worth doing, or just close this sucker?

[Thanks everybody!]

Comment 58 Ben Skeggs 2011-04-25 13:46:23 UTC
We can close this then.  We do have 684608 covering this bug already (and, actually, another one which i originally intended to cover this issue), I didn't duplicate it on suspicion that you were also seeing another bug.

But, let's do it!