210584 – X server stops responding to keyboard, but mouse moves

Bug 210584 - X server stops responding to keyboard, but mouse moves

Summary: X server stops responding to keyboard, but mouse moves

Keywords:
Status:	CLOSED DUPLICATE of bug 219377
Alias:	None
Product:	Red Hat Enterprise Linux 4
Classification:	Red Hat
Component:	xorg-x11
Sub Component:
Version:	4.4
Hardware:	i686
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Adam Jackson
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	248373
TreeView+	depends on / blocked

Reported:	2006-10-12 22:41 UTC by Charlie Brady
Modified:	2007-11-17 01:14 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2007-10-02 20:22:56 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
old Xorg log (42.88 KB, text/plain) 2007-07-16 17:52 UTC, Charlie Brady	no flags	Details
Xorg log (41.89 KB, text/plain) 2007-07-16 17:53 UTC, Charlie Brady	no flags	Details
View All

Description Charlie Brady 2006-10-12 22:41:27 UTC

Description of problem:

I am running CentOS 4 (built from RHEL4 sources) on a Dell workstation. Quite a
few times now I've had a sudden system hang, where nothing appears to work
except that moving the mouse causes the mouse cursor to move. Ctl-Alt-Fn does
not switch back to a text console, and Cnt-Alt-Backspace does not kill X.

I am able to access the system over the network, which I find that X is using
nearly all of the available CPU.

bash-3.00$ ps -fww 4802
UID        PID  PPID  C STIME TTY      STAT   TIME CMD
root      4802  4766  0 Sep14 ?        R    271:30 /usr/X11R6/bin/X :0 -audit 0
-auth /var/gdm/:0.Xauth -nolisten tcp vt7
bash-3.00$

Running strace on the X process shows it to be in a SIGALRM loop with 0 timeout:

...
--- SIGALRM (Alarm clock) @ 0 (0) ---
sigreturn()                             = ? (mask now [])
--- SIGALRM (Alarm clock) @ 0 (0) ---
sigreturn()                             = ? (mask now [])
--- SIGALRM (Alarm clock) @ 0 (0) ---
sigreturn()                             = ? (mask now [])
--- SIGALRM (Alarm clock) @ 0 (0) ---
sigreturn()                             = ? (mask now [])
--- SIGALRM (Alarm clock) @ 0 (0) ---
sigreturn()                             = ? (mask now [])
--- SIGALRM (Alarm clock) @ 0 (0) ---
sigreturn()                             = ? (mask now [])
--- SIGALRM (Alarm clock) @ 0 (0) ---
sigreturn()                             = ? (mask now [])
--- SIGALRM (Alarm clock) @ 0 (0) ---
sigreturn()                             = ? (mask now [])
--- SIGALRM (Alarm clock) @ 0 (0) ---
sigreturn()                             = ? (mask now [])
--- SIGALRM (Alarm clock) @ 0 (0) ---
sigreturn()                             = ? (mask now [])
--- SIGALRM (Alarm clock) @ 0 (0) ---
sigreturn()                             = ? (mask now [])
...

Version-Release number of selected component (if applicable):

xorg-x11-6.8.2-1.EL.13.37.2

How reproducible:

Happens sporadically. I don't know how to reproduce it.

Additional info:

Sending SIGTERM to the X process didn't seem to kill it, so I tried SIGQUIT.
That eventually worked. I don't know whether the first SIGTERM or the first
SIGQUIT would have worked if I'd waited long enough.

Google knows of other reports of similar sounding problem:

http://www.nvnews.net/vbulletin/printthread.php?t=31858&page=6&pp=40
http://my.opera.com/CrazyTerabyte/blog/index.dml/tag/X
http://www.nvnews.net/vbulletin/showthread.php?t=31858

A common feature of those reports is nVidia hardware. This report is no
different - I have:

"nVidia Corporation NV37GL [Quadro FX 330/Quadro NVS280]"

I've had the problem with both nv and nVidia drivers, and am currently using the
nv driver.

Comment 1 Charlie Brady 2006-11-01 17:08:51 UTC

I still have this problem, and I'm seeing it multiple times each day. What can I
do to get more information for you to find and fix the problem?

Comment 2 Charlie Brady 2006-11-01 18:44:17 UTC

The problem has been reported against Debian as well:

http://www.mail-archive.com/debian-bugs-dist@lists.debian.org/msg132215.html

Comment 3 Matěj Cepl 2006-11-02 10:53:59 UTC

Upstream of this bug seems to be https://bugs.freedesktop.org/show_bug.cgi?id=3168
and there is even (untested) workaround:

> A workaround is to disable a single acceleration in xorg.conf:
> 
> Option "XaaNoScreenToScreenCopy"
> 
> With this option set everything work fine for me, albeit
> scrool is quite slow.

Comment 4 Charlie Brady 2006-11-02 15:27:24 UTC

Thanks Matej. From https://bugs.freedesktop.org/show_bug.cgi?id=2155, I see this
comment:

> A lot of users have this problem, it's been discussed on
> http://forums.gentoo.org/viewtopic-t-198023-highlight-nvrm.html (350 messages
> and 24924 reads). As you can see in the forums it doesn't matter if you use
> firefox or konqueror, ati or nvidia, 2.4 or 2.6 kernel, etc.
>
> Right before the freeze the following message is written in the log;
>
> NVRM: Xid: 13, 0000 02009700 00002597 00001528 004a016e 00400000

I see "NVRM: Xid" mentioned in multiple places in the threads I've referenced,
but I'm not seeing that myself:

bash-3.00$ sudo grep NVRM /var/log/messages*
Password:
bash-3.00$ dmesg | grep NVRM
bash-3.00$

> there is even (untested) workaround:

I could try that, but I'd rather do something more specific to localise the
problem. Can you provide instructions for connection to X with gdb, and
obtaining a backtrace? Do you have other investigation strategies to recommend?

The comments in https://bugs.freedesktop.org/show_bug.cgi?id=3168 suggest this
is an AGP problem. The frequency with which I'm seeing it does seem to vary from
week to week, and one thing which might be different is which kernel version I
am running. I'm also running vmware much of the time - I can't say that there is
any correlation with/without vmware running.

Comment 5 Lonni J Friedman 2006-11-02 15:55:49 UTC

"NVRM" is the nvidia resource manager, which is part of the 'nvidia' X driver,
not the 'nv' X driver.  

All of the URLs under "Google knows of other reports of similar sounding
problem" above are for users with the nvidia X driver, and are basically
irrelevant here.

Comment 6 Charlie Brady 2006-11-02 15:57:23 UTC

> A workaround is to disable a single acceleration in xorg.conf:
> 
> Option "XaaNoScreenToScreenCopy"

Note that Suse do this by default for some GeForce cards. They mention problems
with the nv driver but not the nVidia drivers.
http://suse.mirrors.tds.net/pub/suse/i386/9.3/docu/RELEASE-NOTES.en.html

Comment 7 Lonni J Friedman 2006-11-02 16:00:15 UTC

Right, but they mentioned that for SuSE-9.3.  That bug was fixed about a year
ago in the nv driver.

Comment 8 Charlie Brady 2006-11-02 16:06:43 UTC

I've had another lockup - attaching with gdb doesn't tell me a lot:


0x098cc3a8 in ?? ()
(gdb) bt
#0  0x098cc3a8 in ?? ()
#1  0x095f8858 in ?? ()
#2  0x099280c8 in ?? ()
#3  0xbfffc138 in ?? ()
#4  0x09935335 in ?? ()
#5  0x095f8858 in ?? ()
#6  0x09b7d900 in ?? ()
#7  0x09f1cff8 in ?? ()
#8  0x098fc718 in ?? ()
#9  0x00000000 in ?? ()
(gdb)

I've also found another problem with the current configuration. Ctl-Alt-Fn
causes an immediate apparent lockup of X. Mouse cursor no longer moves. strace
shows that stuff is still happening:

select(256, [1 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23], NULL, NULL,
{45, 949000}) = ? ERESTARTNOHAND
(To be restarted)
--- SIGALRM (Alarm clock) @ 0 (0) ---
setitimer(ITIMER_REAL, {it_interval={0, 0}, it_value={0, 0}}, NULL) = 0
sigreturn()                             = ? (mask now [IO])
gettimeofday({1162482444, 508321}, NULL) = 0
gettimeofday({1162482444, 508373}, NULL) = 0
select(256, [1 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23], NULL, NULL,
{45, 938000}) = 1 (in [19], left
{45, 914000})
setitimer(ITIMER_REAL, {it_interval={0, 20000}, it_value={0, 20000}}, NULL) = 0
gettimeofday({1162482444, 532563}, NULL) = 0
read(19, "5\20\4\0\370\2\200\1&\0\200\1\26\0.\0;\3\5\0(\0\200\1\0"..., 4096) = 264
read(19, 0x9ffa0b8, 4096)               = -1 EAGAIN (Resource temporarily
unavailable)
gettimeofday({1162482444, 532844}, NULL) = 0
select(256, [1 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23], NULL, NULL,
{45, 914000}) = ? ERESTARTNOHAND
(To be restarted)
--- SIGALRM (Alarm clock) @ 0 (0) ---
setitimer(ITIMER_REAL, {it_interval={0, 0}, it_value={0, 0}}, NULL) = 0
sigreturn()                             = ? (mask now [IO])
gettimeofday({1162482444, 553291}, NULL) = 0
gettimeofday({1162482444, 553341}, NULL) = 0
select(256, [1 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23], NULL, NULL,
{45, 893000}) = 1 (in [19], left
{45, 868000})
setitimer(ITIMER_REAL, {it_interval={0, 20000}, it_value={0, 20000}}, NULL) = 0
gettimeofday({1162482444, 578431}, NULL) = 0
read(19, "5\20\4\0\372\2\200\1&\0\200\1\26\0.\0;\3\5\0(\0\200\1\0"..., 4096) = 264
read(19, 0x9ffa0b8, 4096)               = -1 EAGAIN (Resource temporarily
unavailable)
gettimeofday({1162482444, 578682}, NULL) = 0
select(256, [1 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23], NULL, NULL,
{45, 868000}) = ? ERESTARTNOHAND
(To be restarted)
--- SIGALRM (Alarm clock) @ 0 (0) ---
setitimer(ITIMER_REAL, {it_interval={0, 0}, it_value={0, 0}}, NULL) = 0
sigreturn()                             = ? (mask now [IO])
gettimeofday({1162482444, 599287}, NULL) = 0
gettimeofday({1162482444, 599338}, NULL) = 0
select(256, [1 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23], NULL, NULL,
{45, 847000}) = 1 (in [19], left
{45, 822000})
setitimer(ITIMER_REAL, {it_interval={0, 20000}, it_value={0, 20000}}, NULL) = 0
gettimeofday({1162482444, 624430}, NULL) = 0
read(19, "5\20\4\0\374\2\200\1&\0\200\1\26\0.\0;\3\5\0(\0\200\1\0"..., 4096) = 264
...

Comment 9 Charlie Brady 2006-11-02 16:07:52 UTC

> I don't know whether the first SIGTERM or the first
> SIGQUIT would have worked if I'd waited long enough.

A single SIGQUIT works to kill the errant X process. SIGINT and SIGTERM do not.

Comment 10 Charlie Brady 2006-11-02 16:11:47 UTC

(In reply to comment #7)
> Right, but they mentioned that for SuSE-9.3.  That bug was fixed about a year
> ago in the nv driver.

I'll have to take your word for that, since there is no bug reference. All I see
is the same symptoms mentioned, and the same workaround. Looks like there may
have been two nv bugs with similar symptoms, one fixed a year ago, and one still
extant. Is that what you suspect?

Do you suspect that the driver is scheduling alarms with 0 timeout?

Comment 11 Charlie Brady 2006-11-02 16:18:13 UTC

> A workaround is to disable a single acceleration in xorg.conf:
> 
> Option "XaaNoScreenToScreenCopy"

http://www.nvnews.net/vbulletin/showthread.php?p=1610 suggests a workaround with
less drastic performance downside - 'add Option "XaaNoOffscreenPixmaps"'.

Comment 12 Matěj Cepl 2006-11-02 17:03:57 UTC

to comment 4: The official instructions for X debugging are on
http://xorg.freedesktop.org/wiki/DebuggingTheXserver
However, if the X-server crashes (not just freezes) then it dumps backtrace to
/var/log/Xorg.0.log, which would be really useful if you send to us. You may try
to send killall -SEGV X from ssh session -- that may persuade X-server to
generate backtrace.

Comment 13 Charlie Brady 2006-11-02 17:07:40 UTC

> However, if the X-server crashes (not just freezes) then it dumps backtrace to
> /var/log/Xorg.0.log, which would be really useful if you send to us.

It doesn't crash.

I doubt that -SEGV will help - the problem I believe is lack of symbols.

I'm rebuilding from source rpm so that I have debuginfo packages available.

Comment 14 Charlie Brady 2006-11-02 20:09:39 UTC

> I'm rebuilding from source rpm so that I have debuginfo packages available.

Hmmm, there aren't any built. Here's why, from the spec file:

# xorg-x11 6.7.0, 6.8.0, 6.8.1
%define parallel_build          0
%define verbose_build           1
# Builds X with debug symbol info.  This results in *FRECKIN* *HUGE* packages.
# Enable this if you want to debug the modular X server.  As of gdb 5.1.1, this
# requires patches to gdb to be completely useful.  I MEAN HUGE, ie 2-3 times
# the size!  Don't complain - you've been warned!  gdb for X is
# available for download from ftp://people.redhat.com/mharris/gdb-xfree86
%define DebuggableBuild         0
# If enabled, this makes the libraries contain debug info.  In the future I
# would like to either have xorg-x11 debuginfoized or build both sets of libs.
%define with_debuglibs          0
# Pass -save-temps to gcc for debugging purposes
%define with_savetemps          0

Comment 15 Mike A. Harris 2006-12-12 05:59:52 UTC

You wont get debuginfo packages from any XFree86 build or monolithic X.Org
build in RHEL or Fedora Core.  Only the modularized X.Org server shipping
in Fedora Core 5 and later has debuginfo integration and direct gdb
debugging available.

Debugging the X server on older releases is much much more of a chore.

Comment 16 Charlie Brady 2006-12-19 02:39:16 UTC

> Running strace on the X process shows it to be in a SIGALRM loop with
> 0 timeout:

Does this indicate a kernel bug? What is generating this apparently infinite set
of alarm signals? The server certainly doesn't appear to be asking for another
alarm in its signal handler.

Comment 17 Nate Straz 2007-07-10 20:45:25 UTC

I've hit the same symptoms several times.  Another co-worker has hit the same
problem and I was able to grab the following data from his workstation.

strace output:

--- SIGALRM (Alarm clock) @ 0 (0) ---
rt_sigreturn(0xe)                       = 28144
--- SIGALRM (Alarm clock) @ 0 (0) ---
rt_sigreturn(0xe)                       = 28144
--- SIGALRM (Alarm clock) @ 0 (0) ---

gdb backtrace:
(gdb) bt
#0  0x00002aaaac047c49 in NVSync ()
   from /usr/lib64/xorg/modules/drivers/nv_drv.so
#1  0x00002aaaae140bb6 in XAAGetFallbackOps ()
   from /usr/lib64/xorg/modules/libxaa.so
#2  0x0000000000512b04 in miInitializeCompositeWrapper ()
#3  0x000000000050e683 in DamageDamageRegion ()
#4  0x00000000004480ae in ProcCopyArea ()
#5  0x0000000000449c9a in Dispatch ()
#6  0x00000000004325d5 in main ()

xorg.conf:

# Xorg configuration created by system-config-display

Section "ServerLayout"
        Identifier     "single head configuration"
        Screen      0  "Screen0" 0 0
        InputDevice    "Keyboard0" "CoreKeyboard"
EndSection

Section "InputDevice"
        Identifier  "Keyboard0"
        Driver      "kbd"
        Option      "XkbModel" "pc105"
        Option      "XkbLayout" "us"
EndSection

Section "Device"
        Identifier  "Videocard0"
        Driver      "nv"
EndSection

Section "Screen"
        Identifier "Screen0"
        Device     "Videocard0"
        DefaultDepth     24
        SubSection "Display"
                Viewport   0 0
                Depth     24
        EndSubSection
EndSection

Comment 18 Nate Straz 2007-07-10 20:47:48 UTC

I should also add that we've hit this running on RHEL 5.0.

xorg-x11-server-Xorg-1.1.1-48.2.el5
xorg-x11-drv-nv-1.2.0-4.fc6

Comment 19 Charlie Brady 2007-07-16 15:55:14 UTC

(In reply to comment #16)
> > Running strace on the X process shows it to be in a SIGALRM loop with
> > 0 timeout:
> 
> Does this indicate a kernel bug? What is generating this apparently infinite
> set of alarm signals? The server certainly doesn't appear to be asking
> for another alarm in its signal handler.

Not only that, but no other process is being scheduled to run often enough to
request an ALARM signal be sent to the X server.

Someone needs to start looking at what is happening in the kernel (unless strace
is lying about what is happening in the X server process).

Comment 20 Matěj Cepl 2007-07-16 16:50:51 UTC

I have no idea, why I haven't asked for /var/log/Xorg.*.log files being attached
as separate uncompressed attachments to this bug. So I am asking for that now.
Reporter, could you please attach these logs?

Thank you

Comment 21 Charlie Brady 2007-07-16 17:52:34 UTC

Created attachment 159345 [details]
old Xorg log

Comment 22 Charlie Brady 2007-07-16 17:53:10 UTC

Created attachment 159346 [details]
Xorg log

Comment 23 Matěj Cepl 2007-10-02 20:22:56 UTC

Actually, I think this duplicate of the famous nvsync bug. Sorry, there is not
much we can do about it (it is widely known among all Xorg distributions and
nobody found much solution for it yet after couple of years it goes around).

*** This bug has been marked as a duplicate of 219377 ***

Note You need to log in before you can comment on or make changes to this bug.