Bug 523905

Summary: Random X server segfaults
Product: [Fedora] Fedora Reporter: Richard Colley <richard.colley>
Component: xorg-x11-drv-nouveauAssignee: Ben Skeggs <bskeggs>
Status: CLOSED RAWHIDE QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: high Docs Contact:
Priority: low    
Version: rawhideCC: airlied, ajax, awilliam, bskeggs, xgl-maint
Target Milestone: ---   
Target Release: ---   
Hardware: i686   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2009-11-05 00:44:29 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
X server log file
none
/var/log/messages
none
startx output
none
xorg.conf
none
stack trace with symbolic info
none
Another stack trace with symbols
none
X stack trace from 23 Sep 09 ,1:31pm+10
none
gdb stack trace of X server after crash
none
X server stack trace
none
Text output to tty7 after X stack trace in attachment #367388
none
X server stack trace
none
startx output associated with X stack trace in attachment #367390
none
X server stack trace
none
startx output associated with X stack trace in attachment #367392
none
X server stack trace - vesa driver
none
startx output associated with X stack trace in attachment #367402
none
X server stack trace - no xorg.conf
none
startx output associated with X stack trace in attachment #367404
none
xorg.conf file associated with stack trace in attachment #367402 none

Description Richard Colley 2009-09-17 06:35:48 UTC
Created attachment 361429 [details]
X server log file

Description of problem:

Every now and then (sometimes frequently, particularly, it seems, while playing audio), the X server will crash.  Stack-trace in attached files.

I can sometimes go a day without a crash, but typically have 3-10 crashes per day.

The symptoms at the time of crash are:

* the screen completely freezes
* mouse freezes
* keyboard freezes (capslock etc don't respond, nor does vt switching)
* 9/10 times, the o/s is still running ... if I hit the power button, the o/s does a graceful shutdown.  I haven't tried ssh'ing in.  On rare occasions, the power-button won't be recognised either, and I am forced to reset the machine.



Version-Release number of selected component (if applicable):

 Many versions (over the last few months of fc12).

 Current versions are:

  kernel-2.6.31-14.fc12.i686
  xorg-x11-server-Xorg-1.6.99.901-2.fc12.i686
  xorg-x11-server-common-1.6.99.901-2.fc12.i686
  xorg-x11-drv-nouveau-0.0.15-10.20090914git1b72020.fc12.i686
  libXrandr-1.3.0-3.fc12.i686
  kdebase-4.3.1-2.fc12.i686
  kdelibs-4.3.1-3.fc12.i686
  qt-4.5.2-18.fc12.i686

 Current video card is an NV96.  Running multi-headed.

How reproducible:

It happens randomly, but I can reasonably readily reproduce by playing music with Amarok, and opening or closing apps.

Steps to Reproduce:
1. n/a
2.
3.
  
Actual results:

X locks up, and sometimes the whole system freezes.


Expected results:

X works without lockups.


Additional info:

Please see attached log files.

Comment 1 Richard Colley 2009-09-17 06:36:15 UTC
Created attachment 361430 [details]
/var/log/messages

Comment 2 Richard Colley 2009-09-17 06:36:38 UTC
Created attachment 361432 [details]
startx output

Comment 3 Richard Colley 2009-09-17 06:53:53 UTC
Just crashed again.  This time the stack trace is a little different:

Backtrace:
0: /usr/bin/X (xorg_backtrace+0x3c) [0x80a3c8c]
1: /usr/bin/X (0x8048000+0x5f4b6) [0x80a74b6]
2: (vdso) (__kernel_rt_sigreturn+0x0) [0x23240c]
3: /usr/bin/X (dixLookupPrivate+0x24) [0x8088f14]
4: /usr/bin/X (FreePicture+0x7f) [0x810f14f]
5: /usr/bin/X (FreeResource+0x112) [0x808c592]
6: /usr/bin/X (0x8048000+0xcef93) [0x8116f93]
7: /usr/bin/X (0x8048000+0xc9b44) [0x8111b44]
8: /usr/bin/X (0x8048000+0x26137) [0x806e137]
9: /usr/bin/X (0x8048000+0x1a885) [0x8062885]
10: /lib/libc.so.6 (__libc_start_main+0xe6) [0x3d0b36]
11: /usr/bin/X (0x8048000+0x1a471) [0x8062471]
Segmentation fault at address 0x150

Fatal server error:
Caught signal 11 (Segmentation fault). Server aborting

Comment 4 Richard Colley 2009-09-17 08:05:38 UTC
Created attachment 361448 [details]
xorg.conf

Comment 5 Richard Colley 2009-09-18 00:09:26 UTC
Created attachment 361579 [details]
stack trace with symbolic info

Comment 6 Richard Colley 2009-09-18 09:34:22 UTC
Created attachment 361618 [details]
Another stack trace with symbols

After this crash, the o/s was still working, and I could shutdown gracefully.

Comment 7 Richard Colley 2009-09-18 09:35:24 UTC
Comment on attachment 361579 [details]
stack trace with symbolic info

After this crash (gdb_log.2214), the o/s had locked up hard, and I had to forcibly power off.

Comment 8 Richard Colley 2009-09-23 03:30:09 UTC
This is still happening.

 Current versions are:

  kernel-2.6.31-33.fc12.i686
  xorg-x11-server-Xorg-1.6.99.902-1.fc12.i686
  xorg-x11-server-common-1.6.99.902-1.fc12.i686
  xorg-x11-drv-nouveau-0.0.15-11.20090921gitdf94ebd.fc12.i686
  libXrandr-1.3.0-3.fc12.i686
  kdebase-4.3.1-2.fc12.i686
  kdelibs-4.3.1-6.fc12.i686
  qt-4.5.2-19.fc12.i686

New stack trace attached.

Comment 9 Richard Colley 2009-09-23 03:31:10 UTC
Created attachment 362162 [details]
X stack trace from 23 Sep 09 ,1:31pm+10

Comment 10 Adam Williamson 2009-09-29 19:19:15 UTC
Ben?

-- 
Fedora Bugzappers volunteer triage team
https://fedoraproject.org/wiki/BugZappers

Comment 11 Ben Skeggs 2009-10-08 03:56:09 UTC
Mmm, not a real idea.  There's been bugfixes to both the X side and nouveau in recent times, though it may be best to wait a bit as there's more fixes to come.  I'll update when they're available :)

Comment 12 Richard Colley 2009-10-08 04:07:46 UTC
I can tell you that it is still happening even with the latest released changes (yesterday's updates).

I'll be watching out for your comments :) Thanks!


xorg-x11-server-Xorg-1.7.0-1.fc12.i686
xorg-x11-drv-nouveau-0.0.15-13.20090929gitdd8339f.fc12.i686
etc.

Comment 13 Adam Williamson 2009-10-15 00:20:16 UTC
you could try this xorg-x11-drv-nouveau build:

http://koji.fedoraproject.org/koji/buildinfo?buildID=135644

and this kernel build:

http://koji.fedoraproject.org/koji/buildinfo?buildID=136674

and see how they behave.

-- 
Fedora Bugzappers volunteer triage team
https://fedoraproject.org/wiki/BugZappers

Comment 14 Richard Colley 2009-10-15 00:31:10 UTC
Thanks Adam, I'll give that a try over the next day or two.

However, if you look at the stack traces, most of the crashes seem to be caused when X is freeing resources, and not related to kernel or driver.  I suspect this is a memory corruption in the user mode part of the X server.

Richard

Comment 15 Richard Colley 2009-10-15 00:33:16 UTC
I should also add, that over time I have come to recognise that this problem most often occurs when I close a window or app.

Comment 16 Adam Williamson 2009-10-15 00:52:02 UTC
there's definitely some system-specific element to this, because I'm running nouveau on current Rawhide and it stays up happily for days at a time.

Ben did say he wanted you to try the latest changes to kernel and nouveau, so that's why I pointed them out. they are not in Rawhide because of the beta freeze.

-- 
Fedora Bugzappers volunteer triage team
https://fedoraproject.org/wiki/BugZappers

Comment 17 Richard Colley 2009-10-19 01:57:27 UTC
Created attachment 365187 [details]
gdb stack trace of X server after crash

I installed the requested kernel and nouveau drivers last Friday, but didn't really get around to using the system much until today.

Unfortunately, the same sort of crash is still occurring.

Please see attached gdb stack trace.

Comment 18 Adam Williamson 2009-10-19 21:39:14 UTC
Thanks. Ben?

-- 
Fedora Bugzappers volunteer triage team
https://fedoraproject.org/wiki/BugZappers

Comment 19 Adam Williamson 2009-10-19 21:40:02 UTC
btw, note that although this is still assigned to nouveau, adam jackson and dave airlie (two of our X server guys) are CCed on it, so if they think it's in the server and have any bright ideas, they'll be jumping in =)

-- 
Fedora Bugzappers volunteer triage team
https://fedoraproject.org/wiki/BugZappers

Comment 20 Richard Colley 2009-10-20 13:24:20 UTC
Is there any chance of a build without optimisations?  In case it the stack trace reveal more.  Also, the next time it happens, is there anything you want me to do in gdb?  Anything you want printed out?

Comment 21 Adam Williamson 2009-10-20 21:00:16 UTC
ben, those questions are for you :)

-- 
Fedora Bugzappers volunteer triage team
https://fedoraproject.org/wiki/BugZappers

Comment 22 Ben Skeggs 2009-10-20 22:31:19 UTC
There still appears to be numerous issues in some EXA changes that happened recently, I suspect that to be the case here too, and am looking into it.

Comment 23 Richard Colley 2009-10-21 09:58:45 UTC
Thanks Ben.  This issue has been happening to me for months, not days or even weeks.  So I'm not sure it is something newly introduced.  Then again...???

I spent a couple of days running X under valgrind, but unfortunately couldn't trigger the bug.  And since it was so horribly slow, I've stopped using it that way.  So sorry I couldn't add anything useful there.

Comment 24 Richard Colley 2009-10-22 05:42:09 UTC
Not sure if this is a related bug, but when changing virtual terminals (from text console to X), I got this segfault...

Program received signal SIGSEGV, Segmentation fault.
0x08089034 in privateExists (key=0x64d898, privates=0x18) at privates.c:79
79          return *key && *privates &&
Current language:  auto
The current source language is "auto; currently c".
(gdb) bt
#0  0x08089034 in privateExists (key=0x64d898, privates=0x18) at privates.c:79
#1  dixLookupPrivate (key=0x64d898, privates=0x18) at privates.c:162
#2  0x00642298 in exaPolyFillRect (pDrawable=<value optimized out>, pGC=<value optimized out>, nrect=<value optimized out>, prect=<value optimized out>)
    at exa_accel.c:764
#3  0x0811c1c6 in damagePolyFillRect (pDrawable=<value optimized out>, pGC=<value optimized out>, nRects=<value optimized out>, pRects=<value optimized out>)
    at damage.c:1404
#4  0x0809be02 in miPaintWindow (pWin=<value optimized out>, prgn=<value optimized out>, what=<value optimized out>) at miexpose.c:670
#5  0x0809c198 in miWindowExposures (pWin=<value optimized out>, prgn=<value optimized out>, other_exposed=<value optimized out>) at miexpose.c:504
#6  0x0817b5ec in xf86XVWindowExposures (pWin=<value optimized out>, reg1=<value optimized out>, reg2=<value optimized out>) at xf86xv.c:1054
#7  0x081ac5e8 in miHandleValidateExposures (pWin=<value optimized out>) at miwindow.c:246
#8  0x08097a54 in MapWindow (pWin=<value optimized out>, client=<value optimized out>) at window.c:2658
#9  0x0806d829 in ProcMapWindow (client=<value optimized out>) at dispatch.c:843
#10 0x0806e187 in Dispatch () at dispatch.c:445
#11 0x08062875 in main (argc=<value optimized out>, argv=<value optimized out>, envp=<value optimized out>) at main.c:285

Comment 25 Richard Colley 2009-11-04 02:44:20 UTC
Still happening, as of the latest (2009-11-04) published versions of everything, including Xorg 1.7.0-5.fc12.

I have had a couple of variants of the crash, and will attach stack dumps after this.

But one interesting thing to note is that I can now more or less get the problem to occur at will.

Using Sun's JRE 1.6.0_16, in particular Java Web Start apps (for example, http://www.playclockwiser.com/clockwiser.jnlp), and running from the console like so:

/usr/java/jre1.6.0_16/bin/javaws ~/Download/clockwiser.jnlp

... is a pretty reliable way to cause X to crash ... particularly when the application is shut down. (Not every time, but very often).  NB: it is not just this app, but rather this is a convenient publicly available app that triggers the bug for me.

Stack traces etc to be attached next.

Comment 26 Richard Colley 2009-11-04 02:46:58 UTC
Created attachment 367388 [details]
X server stack trace

Comment 27 Richard Colley 2009-11-04 02:47:58 UTC
Created attachment 367389 [details]
Text output to tty7 after X stack trace in attachment #367388 [details]

Comment 28 Richard Colley 2009-11-04 02:51:46 UTC
Created attachment 367390 [details]
X server stack trace

This stack trace was taken when X died after shutting down a JWS app.

NB: there seems to be no driver specific involvement in this particular crash.

Comment 29 Richard Colley 2009-11-04 02:52:35 UTC
Created attachment 367391 [details]
startx output associated with X stack trace in attachment #367390 [details]

Comment 30 Richard Colley 2009-11-04 02:54:51 UTC
Created attachment 367392 [details]
X server stack trace

yet another stack trace from an X crash when closing a JWS app.

NB: this is different but similar to the other crashes involving exa.

Comment 31 Richard Colley 2009-11-04 02:55:30 UTC
Created attachment 367393 [details]
startx output associated with X stack trace in attachment #367392 [details]

Comment 32 Richard Colley 2009-11-04 04:41:03 UTC
I have a couple more stack traces to add.

The reason these are notable is that:

a) all previous runs of X were with the xorg.conf attached to this bug.  The first new stack trace comes from a run after I deleted the xorg.conf and let X autodetect everything.  The desktop displayed correctly, but I could still induce a crash.

b) this run is important because it is with the vesa driver, not nouveau!  It goes some way to showing this problem is a server issue, and not (only) a driver one.

Comment 33 Richard Colley 2009-11-04 04:42:34 UTC
Created attachment 367402 [details]
X server stack trace - vesa driver

Comment 34 Richard Colley 2009-11-04 04:43:19 UTC
Created attachment 367403 [details]
startx output associated with X stack trace in attachment #367402 [details]

Comment 35 Richard Colley 2009-11-04 04:43:58 UTC
Created attachment 367404 [details]
X server stack trace - no xorg.conf

Comment 36 Richard Colley 2009-11-04 04:44:45 UTC
Created attachment 367405 [details]
startx output associated with X stack trace in attachment #367404 [details]

Comment 37 Richard Colley 2009-11-04 04:45:59 UTC
Created attachment 367406 [details]
xorg.conf file associated with stack trace in attachment #367402 [details]

Comment 38 Richard Colley 2009-11-04 23:01:13 UTC
I can't believe it!!! I have updated to the latest X server 1.7.1-6 and things are stable! At least so far. I have tried my previous repeatable ways to get the X server to crash, and it hasn't!!!!

After many months of being told its my graphics card, finally an X server update fixes this. Perhaps its been a mixture of X server and driver problems. Whatever, it feels good to get some confidence back in my desktop.

I really really really hope that you seriously consider releasing 1.7.1 as part of FC12 final. I'm sure you will regret it otherwise.

I hope these celebrations aren't temporary. But I'll keep running X and gdb for now, just in case.

Comment 39 Adam Williamson 2009-11-05 00:44:29 UTC
issues that are fixed in the server can still only manifest on some cards.

1.7.1-5 has already been tagged, that would be enough to fix your issue. -6's change is unrelated. let's close this one, then - re-open if the problems come back.

-- 
Fedora Bugzappers volunteer triage team
https://fedoraproject.org/wiki/BugZappers

Comment 40 Richard Colley 2009-11-05 00:57:53 UTC
Then why did the problems occur with the vesa driver too?

Comment 41 Adam Williamson 2009-11-05 01:28:33 UTC
presumably because the bug was in the server code. we never said the bug was in the driver code, ben only suggested at first that that might be the case, as a possibility. it's not important, the important thing is it's fixed...

-- 
Fedora Bugzappers volunteer triage team
https://fedoraproject.org/wiki/BugZappers