Description of problem: I am running CentOS 4 (built from RHEL4 sources) on a Dell workstation. Quite a few times now I've had a sudden system hang, where nothing appears to work except that moving the mouse causes the mouse cursor to move. Ctl-Alt-Fn does not switch back to a text console, and Cnt-Alt-Backspace does not kill X. I am able to access the system over the network, which I find that X is using nearly all of the available CPU. bash-3.00$ ps -fww 4802 UID PID PPID C STIME TTY STAT TIME CMD root 4802 4766 0 Sep14 ? R 271:30 /usr/X11R6/bin/X :0 -audit 0 -auth /var/gdm/:0.Xauth -nolisten tcp vt7 bash-3.00$ Running strace on the X process shows it to be in a SIGALRM loop with 0 timeout: ... --- SIGALRM (Alarm clock) @ 0 (0) --- sigreturn() = ? (mask now []) --- SIGALRM (Alarm clock) @ 0 (0) --- sigreturn() = ? (mask now []) --- SIGALRM (Alarm clock) @ 0 (0) --- sigreturn() = ? (mask now []) --- SIGALRM (Alarm clock) @ 0 (0) --- sigreturn() = ? (mask now []) --- SIGALRM (Alarm clock) @ 0 (0) --- sigreturn() = ? (mask now []) --- SIGALRM (Alarm clock) @ 0 (0) --- sigreturn() = ? (mask now []) --- SIGALRM (Alarm clock) @ 0 (0) --- sigreturn() = ? (mask now []) --- SIGALRM (Alarm clock) @ 0 (0) --- sigreturn() = ? (mask now []) --- SIGALRM (Alarm clock) @ 0 (0) --- sigreturn() = ? (mask now []) --- SIGALRM (Alarm clock) @ 0 (0) --- sigreturn() = ? (mask now []) --- SIGALRM (Alarm clock) @ 0 (0) --- sigreturn() = ? (mask now []) ... Version-Release number of selected component (if applicable): xorg-x11-6.8.2-1.EL.13.37.2 How reproducible: Happens sporadically. I don't know how to reproduce it. Additional info: Sending SIGTERM to the X process didn't seem to kill it, so I tried SIGQUIT. That eventually worked. I don't know whether the first SIGTERM or the first SIGQUIT would have worked if I'd waited long enough. Google knows of other reports of similar sounding problem: http://www.nvnews.net/vbulletin/printthread.php?t=31858&page=6&pp=40 http://my.opera.com/CrazyTerabyte/blog/index.dml/tag/X http://www.nvnews.net/vbulletin/showthread.php?t=31858 A common feature of those reports is nVidia hardware. This report is no different - I have: "nVidia Corporation NV37GL [Quadro FX 330/Quadro NVS280]" I've had the problem with both nv and nVidia drivers, and am currently using the nv driver.
I still have this problem, and I'm seeing it multiple times each day. What can I do to get more information for you to find and fix the problem?
The problem has been reported against Debian as well: http://www.mail-archive.com/debian-bugs-dist@lists.debian.org/msg132215.html
Upstream of this bug seems to be https://bugs.freedesktop.org/show_bug.cgi?id=3168 and there is even (untested) workaround: > A workaround is to disable a single acceleration in xorg.conf: > > Option "XaaNoScreenToScreenCopy" > > With this option set everything work fine for me, albeit > scrool is quite slow.
Thanks Matej. From https://bugs.freedesktop.org/show_bug.cgi?id=2155, I see this comment: > A lot of users have this problem, it's been discussed on > http://forums.gentoo.org/viewtopic-t-198023-highlight-nvrm.html (350 messages > and 24924 reads). As you can see in the forums it doesn't matter if you use > firefox or konqueror, ati or nvidia, 2.4 or 2.6 kernel, etc. > > Right before the freeze the following message is written in the log; > > NVRM: Xid: 13, 0000 02009700 00002597 00001528 004a016e 00400000 I see "NVRM: Xid" mentioned in multiple places in the threads I've referenced, but I'm not seeing that myself: bash-3.00$ sudo grep NVRM /var/log/messages* Password: bash-3.00$ dmesg | grep NVRM bash-3.00$ > there is even (untested) workaround: I could try that, but I'd rather do something more specific to localise the problem. Can you provide instructions for connection to X with gdb, and obtaining a backtrace? Do you have other investigation strategies to recommend? The comments in https://bugs.freedesktop.org/show_bug.cgi?id=3168 suggest this is an AGP problem. The frequency with which I'm seeing it does seem to vary from week to week, and one thing which might be different is which kernel version I am running. I'm also running vmware much of the time - I can't say that there is any correlation with/without vmware running.
"NVRM" is the nvidia resource manager, which is part of the 'nvidia' X driver, not the 'nv' X driver. All of the URLs under "Google knows of other reports of similar sounding problem" above are for users with the nvidia X driver, and are basically irrelevant here.
> A workaround is to disable a single acceleration in xorg.conf: > > Option "XaaNoScreenToScreenCopy" Note that Suse do this by default for some GeForce cards. They mention problems with the nv driver but not the nVidia drivers. http://suse.mirrors.tds.net/pub/suse/i386/9.3/docu/RELEASE-NOTES.en.html
Right, but they mentioned that for SuSE-9.3. That bug was fixed about a year ago in the nv driver.
I've had another lockup - attaching with gdb doesn't tell me a lot: 0x098cc3a8 in ?? () (gdb) bt #0 0x098cc3a8 in ?? () #1 0x095f8858 in ?? () #2 0x099280c8 in ?? () #3 0xbfffc138 in ?? () #4 0x09935335 in ?? () #5 0x095f8858 in ?? () #6 0x09b7d900 in ?? () #7 0x09f1cff8 in ?? () #8 0x098fc718 in ?? () #9 0x00000000 in ?? () (gdb) I've also found another problem with the current configuration. Ctl-Alt-Fn causes an immediate apparent lockup of X. Mouse cursor no longer moves. strace shows that stuff is still happening: select(256, [1 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23], NULL, NULL, {45, 949000}) = ? ERESTARTNOHAND (To be restarted) --- SIGALRM (Alarm clock) @ 0 (0) --- setitimer(ITIMER_REAL, {it_interval={0, 0}, it_value={0, 0}}, NULL) = 0 sigreturn() = ? (mask now [IO]) gettimeofday({1162482444, 508321}, NULL) = 0 gettimeofday({1162482444, 508373}, NULL) = 0 select(256, [1 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23], NULL, NULL, {45, 938000}) = 1 (in [19], left {45, 914000}) setitimer(ITIMER_REAL, {it_interval={0, 20000}, it_value={0, 20000}}, NULL) = 0 gettimeofday({1162482444, 532563}, NULL) = 0 read(19, "5\20\4\0\370\2\200\1&\0\200\1\26\0.\0;\3\5\0(\0\200\1\0"..., 4096) = 264 read(19, 0x9ffa0b8, 4096) = -1 EAGAIN (Resource temporarily unavailable) gettimeofday({1162482444, 532844}, NULL) = 0 select(256, [1 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23], NULL, NULL, {45, 914000}) = ? ERESTARTNOHAND (To be restarted) --- SIGALRM (Alarm clock) @ 0 (0) --- setitimer(ITIMER_REAL, {it_interval={0, 0}, it_value={0, 0}}, NULL) = 0 sigreturn() = ? (mask now [IO]) gettimeofday({1162482444, 553291}, NULL) = 0 gettimeofday({1162482444, 553341}, NULL) = 0 select(256, [1 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23], NULL, NULL, {45, 893000}) = 1 (in [19], left {45, 868000}) setitimer(ITIMER_REAL, {it_interval={0, 20000}, it_value={0, 20000}}, NULL) = 0 gettimeofday({1162482444, 578431}, NULL) = 0 read(19, "5\20\4\0\372\2\200\1&\0\200\1\26\0.\0;\3\5\0(\0\200\1\0"..., 4096) = 264 read(19, 0x9ffa0b8, 4096) = -1 EAGAIN (Resource temporarily unavailable) gettimeofday({1162482444, 578682}, NULL) = 0 select(256, [1 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23], NULL, NULL, {45, 868000}) = ? ERESTARTNOHAND (To be restarted) --- SIGALRM (Alarm clock) @ 0 (0) --- setitimer(ITIMER_REAL, {it_interval={0, 0}, it_value={0, 0}}, NULL) = 0 sigreturn() = ? (mask now [IO]) gettimeofday({1162482444, 599287}, NULL) = 0 gettimeofday({1162482444, 599338}, NULL) = 0 select(256, [1 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23], NULL, NULL, {45, 847000}) = 1 (in [19], left {45, 822000}) setitimer(ITIMER_REAL, {it_interval={0, 20000}, it_value={0, 20000}}, NULL) = 0 gettimeofday({1162482444, 624430}, NULL) = 0 read(19, "5\20\4\0\374\2\200\1&\0\200\1\26\0.\0;\3\5\0(\0\200\1\0"..., 4096) = 264 ...
> I don't know whether the first SIGTERM or the first > SIGQUIT would have worked if I'd waited long enough. A single SIGQUIT works to kill the errant X process. SIGINT and SIGTERM do not.
(In reply to comment #7) > Right, but they mentioned that for SuSE-9.3. That bug was fixed about a year > ago in the nv driver. I'll have to take your word for that, since there is no bug reference. All I see is the same symptoms mentioned, and the same workaround. Looks like there may have been two nv bugs with similar symptoms, one fixed a year ago, and one still extant. Is that what you suspect? Do you suspect that the driver is scheduling alarms with 0 timeout?
> A workaround is to disable a single acceleration in xorg.conf: > > Option "XaaNoScreenToScreenCopy" http://www.nvnews.net/vbulletin/showthread.php?p=1610 suggests a workaround with less drastic performance downside - 'add Option "XaaNoOffscreenPixmaps"'.
to comment 4: The official instructions for X debugging are on http://xorg.freedesktop.org/wiki/DebuggingTheXserver However, if the X-server crashes (not just freezes) then it dumps backtrace to /var/log/Xorg.0.log, which would be really useful if you send to us. You may try to send killall -SEGV X from ssh session -- that may persuade X-server to generate backtrace.
> However, if the X-server crashes (not just freezes) then it dumps backtrace to > /var/log/Xorg.0.log, which would be really useful if you send to us. It doesn't crash. I doubt that -SEGV will help - the problem I believe is lack of symbols. I'm rebuilding from source rpm so that I have debuginfo packages available.
> I'm rebuilding from source rpm so that I have debuginfo packages available. Hmmm, there aren't any built. Here's why, from the spec file: # xorg-x11 6.7.0, 6.8.0, 6.8.1 %define parallel_build 0 %define verbose_build 1 # Builds X with debug symbol info. This results in *FRECKIN* *HUGE* packages. # Enable this if you want to debug the modular X server. As of gdb 5.1.1, this # requires patches to gdb to be completely useful. I MEAN HUGE, ie 2-3 times # the size! Don't complain - you've been warned! gdb for X is # available for download from ftp://people.redhat.com/mharris/gdb-xfree86 %define DebuggableBuild 0 # If enabled, this makes the libraries contain debug info. In the future I # would like to either have xorg-x11 debuginfoized or build both sets of libs. %define with_debuglibs 0 # Pass -save-temps to gcc for debugging purposes %define with_savetemps 0
You wont get debuginfo packages from any XFree86 build or monolithic X.Org build in RHEL or Fedora Core. Only the modularized X.Org server shipping in Fedora Core 5 and later has debuginfo integration and direct gdb debugging available. Debugging the X server on older releases is much much more of a chore.
> Running strace on the X process shows it to be in a SIGALRM loop with > 0 timeout: Does this indicate a kernel bug? What is generating this apparently infinite set of alarm signals? The server certainly doesn't appear to be asking for another alarm in its signal handler.
I've hit the same symptoms several times. Another co-worker has hit the same problem and I was able to grab the following data from his workstation. strace output: --- SIGALRM (Alarm clock) @ 0 (0) --- rt_sigreturn(0xe) = 28144 --- SIGALRM (Alarm clock) @ 0 (0) --- rt_sigreturn(0xe) = 28144 --- SIGALRM (Alarm clock) @ 0 (0) --- gdb backtrace: (gdb) bt #0 0x00002aaaac047c49 in NVSync () from /usr/lib64/xorg/modules/drivers/nv_drv.so #1 0x00002aaaae140bb6 in XAAGetFallbackOps () from /usr/lib64/xorg/modules/libxaa.so #2 0x0000000000512b04 in miInitializeCompositeWrapper () #3 0x000000000050e683 in DamageDamageRegion () #4 0x00000000004480ae in ProcCopyArea () #5 0x0000000000449c9a in Dispatch () #6 0x00000000004325d5 in main () xorg.conf: # Xorg configuration created by system-config-display Section "ServerLayout" Identifier "single head configuration" Screen 0 "Screen0" 0 0 InputDevice "Keyboard0" "CoreKeyboard" EndSection Section "InputDevice" Identifier "Keyboard0" Driver "kbd" Option "XkbModel" "pc105" Option "XkbLayout" "us" EndSection Section "Device" Identifier "Videocard0" Driver "nv" EndSection Section "Screen" Identifier "Screen0" Device "Videocard0" DefaultDepth 24 SubSection "Display" Viewport 0 0 Depth 24 EndSubSection EndSection
I should also add that we've hit this running on RHEL 5.0. xorg-x11-server-Xorg-1.1.1-48.2.el5 xorg-x11-drv-nv-1.2.0-4.fc6
(In reply to comment #16) > > Running strace on the X process shows it to be in a SIGALRM loop with > > 0 timeout: > > Does this indicate a kernel bug? What is generating this apparently infinite > set of alarm signals? The server certainly doesn't appear to be asking > for another alarm in its signal handler. Not only that, but no other process is being scheduled to run often enough to request an ALARM signal be sent to the X server. Someone needs to start looking at what is happening in the kernel (unless strace is lying about what is happening in the X server process).
I have no idea, why I haven't asked for /var/log/Xorg.*.log files being attached as separate uncompressed attachments to this bug. So I am asking for that now. Reporter, could you please attach these logs? Thank you
Created attachment 159345 [details] old Xorg log
Created attachment 159346 [details] Xorg log
Actually, I think this duplicate of the famous nvsync bug. Sorry, there is not much we can do about it (it is widely known among all Xorg distributions and nobody found much solution for it yet after couple of years it goes around). *** This bug has been marked as a duplicate of 219377 ***