120679 – X stuck in 'lightning' screensaver

Bug 120679 - X stuck in 'lightning' screensaver

Summary: X stuck in 'lightning' screensaver

Keywords:
Status:	CLOSED NEXTRELEASE
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	kernel
Sub Component:
Version:	2
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	John Dennis
QA Contact:	Brian Brock
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2004-04-12 22:13 UTC by Michal Jaegermann
Modified:	2007-11-30 22:10 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2005-04-16 05:44:57 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
An output from "Show State" on a lightning stricken machine (44.00 KB, text/plain) 2004-04-12 22:15 UTC, Michal Jaegermann	no flags	Details
Differences in logs from lock up in X and a normal start (5.49 KB, patch) 2004-04-22 18:09 UTC, Michal Jaegermann	no flags	Details \| Diff
View All

Description Michal Jaegermann 2004-04-12 22:13:35 UTC

Description of problem:


I found my test machine stuck with a static picture from
a "lightning" xscreensaver.  No local access at all but it was
still possible to get there over a network.  'top' had an
interesting display: -)

 PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
9833 root      25   0 93068  13m  80m R 99.8  2.6  49:41.18 X
.....

'ps' showed 'lightning' in SN state and 'xscreensaver' as S.
Attaching 'strace' to these two processes gave both of them stuck in:
select(4, [3], NULL, NULL, NULL <unfinished ...>

With a help of SysRq -  Show Regs gives this:

Pid: 9833, comm:                    X
EIP: 0060:[<e0b468ca>] CPU: 0
EIP is at radeon_do_wait_for_fifo+0x23/0x45 [radeon]
 EFLAGS: 00003246    Not tainted  (2.6.5-1.309smp)
EAX: 80116100 EBX: df434400 ECX: 3f4634a4 EDX: 00000000
ESI: 00000308 EDI: 00000040 EBP: 00006444 DS: 007b ES: 007b
CR0: 8005003b CR2: b43c4000 CR3: 1ffafc60 CR4: 000006f0
Call Trace:
 [<e0b468fe>] radeon_do_wait_for_idle+0x12/0x55 [radeon]
 [<e0b4337a>] radeon_ioctl+0xe8/0xf5 [radeon]
 [<e0b47d0f>] radeon_cp_idle+0x0/0x6c [radeon]
 [<c016b7ce>] sys_ioctl+0x24a/0x2b1
 [<c01062f3>] syscall_call+0x7/0xb

An output from "SysRq: Show State" is attached.

Attempts to kill X or its subprocesses did not haveany discernible
effects.  Similarly 'telinit 3'.  'shutdown -h now' resulted only
in a communication loss but a machine was not going down (or at
least not in a hurry :-).  Recovered eventually with a power button.

Version-Release number of selected component (if applicable):
xorg-x11-0.6.6-0.2004_03_30.5
kernel-smp-2.6.5-1.309.i686
xscreensaver-4.14-4

Comment 1 Michal Jaegermann 2004-04-12 22:15:56 UTC

Created attachment 99345 [details]
An output from "Show State" on a lightning stricken machine

Comment 2 Michal Jaegermann 2004-04-12 22:20:39 UTC

Should have mention that:
"ATI Radeon VE/7000 QY (AGP/PCI)" (ChipID = 0x5159)
in a PCI socket and with 64 Megs of a video memory.

Comment 3 Mike A. Harris 2004-04-13 20:24:14 UTC

Reassigning to kernel component...

Comment 4 Michal Jaegermann 2004-04-14 03:11:37 UTC

Got another one today with the same content of 'Show Regs'.  Here
is a call trace:

  [<e0b468fe>] radeon_do_wait_for_idle+0x12/0x55 [radeon]
  [<e0b4337a>] radeon_ioctl+0xe8/0xf5 [radeon]
  [<c02cbb45>] schedule_timeout+0x13/0xae
  [<c01eff0e>] read_chan+0x389/0x960
  [<c011d2a9>] scheduler_tick+0x56e/0x576
  [<c011d2b1>] default_wake_function+0x0/0xc
  [<c011d2b1>] default_wake_function+0x0/0xc
  [<c01eb096>] tty_read+0xf7/0x177
  [<c01599e0>] vfs_read+0xb8/0xe4
  [<c0159bb9>] sys_read+0x2c/0x42
  [<c01062f3>] syscall_call+0x7/0xb

The only difference is that this time a screensaver was 'wander'
instead of 'lightning'.

I will try later if changing kernel versions makes difference.
A troublesome aspect is that the problem hits "at random".

Comment 5 John Dennis 2004-04-14 16:50:05 UTC

I can take this one, its not really a kernel issue even though its in
a kernel module. I've tracked down other hangs like this previously
with DRM and radeon.

If possible it would help if you could get a stack trace of the screen
saver when it hangs.

Comment 6 Michal Jaegermann 2004-04-14 21:21:06 UTC

X server died on me once again, taking away my console obviously
enough, but I am not sure if this is the same thing as before.

This time monitor displays a big "built-in" alert

      OUT OF RANGE
     Hf: 28KHz-70KHz
     Vf: 40Hz-120Hz
      18.0Khz/ 22Hz

The last line supposedly gives parameters of a signal fed to a monitor
right now.

There are similarities.  'top' reports:

 PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
1248 root      25   0 88476 9428  79m R 99.9  1.8  31:22.62 X       

and in 'ps' one can find

 4157  0.0  0.4  5324 2216 ?   S    14:11   0:00 xscreensaver -nosplash
 4228  0.0  0.2  3720 1324 ?   SN   14:51   0:00  \_ mountain -root

OTOH "Show Regs" gives another location:

SysRq : Show Regs

Pid: 1248, comm:                    X
EIP: 0060:[<c01127cf>] CPU: 0
EIP is at delay_pmtmr+0xb/0x13
 EFLAGS: 00003283    Not tainted  (2.6.5-1.309smp)
EAX: 42414ef5 EBX: 000003e8 ECX: 42414bb1 EDX: 00000b3d
ESI: 00000000 EDI: 00006daa EBP: 00000020 DS: 007b ES: 007b
CR0: 8005003b CR2: b6077000 CR3: 1ffaf860 CR4: 000006f0
Call Trace:
 [<c01bc421>] __delay+0x9/0xa
 [<e0b47ec5>] radeon_freelist_get+0xd6/0x10b [radeon]
 [<e0b47fdb>] radeon_cp_get_buffers+0x3b/0xe5 [radeon]
 [<e0b481d5>] radeon_cp_buffers+0x150/0x1a7 [radeon]
 [<c0286429>] rt_garbage_collect+0x5b/0x2e9
 [<e0b4337a>] radeon_ioctl+0xe8/0xf5 [radeon]
 [<e0b48085>] radeon_cp_buffers+0x0/0x1a7 [radeon]
 [<c0286429>] rt_garbage_collect+0x5b/0x2e9
 [<c016b7ce>] sys_ioctl+0x24a/0x2b1
 [<c01062f3>] syscall_call+0x7/0xb
 [<c0286429>] rt_garbage_collect+0x5b/0x2e9


Here is what I can get from backtrace for 'xscreensaver'

(gdb) bt
#0  0x003b07a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2
#1  0x0047bfcd in ___newselect_nocancel () from /lib/tls/libc.so.6
#2  0x009c2322 in _XPollfdCacheDel () from /usr/X11R6/lib/libX11.so.6
#3  0x009c3241 in _XRead () from /usr/X11R6/lib/libX11.so.6
#4  0x009c3db3 in _XReply () from /usr/X11R6/lib/libX11.so.6
#5  0x009ba3eb in XQueryTree () from /usr/X11R6/lib/libX11.so.6
#6  0x080533c7 in ?? ()
#7  0x08a65ee0 in ?? ()
#8  0x00a0003b in _XlcAddCT () from /usr/X11R6/lib/libX11.so.6
#9  0x08053583 in ?? ()
#10 0xbff10da0 in ?? ()
#11 0x00a0003b in _XlcAddCT () from /usr/X11R6/lib/libX11.so.6
#12 0x008d0f88 in _XtRemoveAllInputs () from /usr/X11R6/lib/libXt.so.6
#13 0x008d1311 in XtAppNextEvent () from /usr/X11R6/lib/libXt.so.6
#14 0x08054128 in ?? ()
#15 0x08a61270 in ?? ()
#16 0xbff10c40 in ?? ()
#17 0x00000000 in ?? ()

and this is for 'xscreensaver/mountain'

(gdb) bt
#0  0x003b07a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2
#1  0x0047bfcd in ___newselect_nocancel () from /lib/tls/libc.so.6
#2  0x009c2322 in _XPollfdCacheDel () from /usr/X11R6/lib/libX11.so.6
#3  0x009c3241 in _XRead () from /usr/X11R6/lib/libX11.so.6
#4  0x009c3db3 in _XReply () from /usr/X11R6/lib/libX11.so.6
#5  0x009bf224 in XSync () from /usr/X11R6/lib/libX11.so.6
#6  0x0804bd7d in ?? ()
#7  0x08fb8760 in ?? ()
#8  0x00000000 in ?? ()

(I am behind not too speedy DSL connection and I do not have
debugging packages installed but all X stuff is still at
xorg-x11-0.6.6-0.2004_03_30.5 level and 2.6.5-1.309smp kernel).

Comment 7 John Dennis 2004-04-15 22:00:39 UTC

I haven't succeeded in reproducing this yet. How long does it usually
take before the problem manifests itself?

Comment 8 Michal Jaegermann 2004-04-15 22:21:30 UTC

Hard to tell.  I have to leave the whole thing usually for a few
hours but I had occasions when this happened below an hour.
I wonder if SMP is not relevant here.

Comment 9 Michal Jaegermann 2004-04-15 23:24:38 UTC

Hm, got another one but this time no radeon is evident anywhere.

SysRq : Show Regs

Pid: 0, comm:              swapper
EIP: 0060:[<c0104041>] CPU: 1
EIP is at default_idle+0x29/0x2c
 EFLAGS: 00000246    Not tainted  (2.6.5-1.309smp)
EAX: 00000000 EBX: 00000000 ECX: c0104018 EDX: dfedc000
ESI: 00000000 EDI: 00000000 EBP: 00000000 DS: 007b ES: 007b
CR0: 8005003b CR2: 0031e940 CR3: 1ffaf6a0 CR4: 000006f0
Call Trace:
 [<c010409d>] cpu_idle+0x26/0x3b
 [<c01216c2>] call_console_drivers+0xbe/0xe3
 [<c0121926>] printk+0x1dd/0x213

but effects are similar.  Only a network access. In 'ps' output
_everything_ is listed in a state of S or S<something> and
xscreensaver does not have any child listed.  Hard to imagine
that any or recent updates could have such effect (it is still
the same xorg-x11 and the same kernel but some other things were
updated). A backtrace from xscreensaver looks exactly like before.

An attempt to 'chvt 1' hangs like that in 'strace':
......
open("/dev/tty", O_RDWR)                = 3
ioctl(3, 0x4b33, 0xbff06f63)            = -1 EINVAL (Invalid argument)
close(3)                                = 0
open("/dev/tty0", O_RDWR)               = 3
ioctl(3, 0x4b33, 0xbff06f63)            = 0
ioctl(3, 0x5606, 0x1)                   = 0
ioctl(3, 0x5607

If I try 'telinit 3' then this ends like that:
.....
open("/dev/initctl", O_WRONLY)          = 3
write(3, "i\31\t\3\1\0\0\0003\0\0\0\5\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
384) = 384
close(3)                                = 0
alarm(0)                                = 3
exit_group(0)                           = ?

and nothing visible happens.  X is still in an output of 'ps'.

Comment 10 John Dennis 2004-04-16 14:24:49 UTC

FWIW, with a very similar configuration it ran overnight without any
problems. I'm sure you already checked this but were there any
suspicious messages in either /var/log/messages or
/var/log/XFree86.0.log? BTW, the name of the X log file has changed
recently to Xorg.0.log but I'm pretty sure the rpm you have installed
is still using the old name.

When it was hung like this were you able to ssh in and do an strace of
the Xorg process?

Comment 11 Michal Jaegermann 2004-04-16 15:56:10 UTC

> I'm sure you already checked this but were there any
> suspicious messages in either /var/log/messages or
> /var/log/XFree86.0.log?

I am afraid that I could not find anything suspicious/unusual.

> BTW, the name of the X log file has changed
> recently to Xorg.0.log

Indeed, but I kept so far xorg-x11-0.6.6-0.2004_03_30.5 not
to change environment.  I may try to update that and a kernel
to the most recent ones available to me and see what will
happen.


> When it was hung like this were you able to ssh in and
> do an strace of the Xorg process?

I mentioned that in my original report. 'strace' shows it sitting
in select, like that:
select(4, [3], NULL, NULL, NULL
and that call never terminates. 'xscreensaver' process indeed
was showing the same.

BTW - this particular box is an SMP Athlon on a Tyan board and
it did run RH 7.3 installation without any incidents.  I may retest
that just to be sure but this one is hacky due to missing video
drivers.

Comment 12 Michal Jaegermann 2004-04-17 15:44:08 UTC

It turned out that I have RH9 handy and I was running right now,
on the same machine(!) the same experiment for the past 23 hours
using kernel-smp-2.4.20-30.9.athlon, XFree86-4.3.0-2.90.55 and
xscreensaver-4.07-2.  Nothing bad happened nor it looks like that
it is going to.  This appears to eliminate a possibility that
a hardware is acting up.

One thing less obvious is that with RH9 I am using kernel.athlon
while with FC2 there is no such thing so kernel.i686 comes into
a play instead.  No idea if this may be of any relevance.

Comment 13 Michal Jaegermann 2004-04-17 17:22:33 UTC

After switching to xorg-x11-6.7.0-0.5 and kernel-smp-2.6.5-1.326
the first attempt to boot into X died already in gdmgreeter with
an already familiar output from 'top':

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 1194 root      25   0 87648 8480  78m R 99.9  1.6   5:33.11 X
.....

The corresponding fragment from 'ps' looked roughly

 1143  0.0  0.4     S    10:31   0:00 /usr/bin/gdm-binary -nodaemon
 1193  0.0  0.5     S    10:31   0:00  \_ /usr/bin/gdm-binary -nodaemon
 1194 97.2  1.6     R    10:31   6:00      \_ /usr/X11R6/bin/X :0
-audit 0 -auth /var/gdm/:0.Xauth -nolisten tcp vt7
 1226  0.2  1.7     S    10:31   0:00      \_ /usr/bin/gdmgreeter

A screen was messed up with a garbled strip at the top, a horizontal
line of a red and white dashes two thirds of a screen down from the left
edge roughly to the middle of a screen, and none of normal greeter
elements whatsoever.  'gdmgreeter' sits in non-returning 'select' call.
Nothing particular in logs and sysrq gives for a change that:

 SysRq : Show Regs
 
 Pid: 1194, comm:                    X
 EIP: 0060:[<02111e0f>] CPU: 1
 EIP is at delay_pmtmr+0xb/0x13
  EFLAGS: 00003283    Not tainted  (2.6.5-1.326smp)
 EAX: def9c74c EBX: 000003e8 ECX: def9c673 EDX: 000000a4
 ESI: 00000000 EDI: 00000000 EBP: 00000020 DS: 007b ES: 007b
 CR0: 8005003b CR2: f6f72000 CR3: 003aa000 CR4: 000006f0
 Call Trace:
  [<021b61e1>] __delay+0x9/0xa
  [<22adde29>] radeon_freelist_get+0xd6/0x10b [radeon]
  [<22addf2a>] radeon_cp_get_buffers+0x26/0x84 [radeon]
  [<22ade084>] radeon_cp_buffers+0xfc/0x124 [radeon]
  [<22ad9748>] radeon_ioctl+0xe8/0xf5 [radeon]
  [<22addf88>] radeon_cp_buffers+0x0/0x124 [radeon]
  [<02167c38>] sys_ioctl+0x23d/0x2a0
  [<021059e8>] sys_sigreturn+0x113/0x138

The second attempt to boot ended up with a fully dead machine the
moment X tried to start.  A blank screen, no network, no keyboard,
nothing but a power switch.

The third attempt resulted surprisingly in a normal login and so
far the whole thing runs.  I got reported many times alert about
xkb errors but a keyboard seems to be ok.

My guess would be that something is scribbling over a memory which
does not belong to it and that is why things "move" and some,
seemingly similar, configurations may be "lucky".

Comment 14 Michal Jaegermann 2004-04-18 16:58:03 UTC

If this is a race condition then it seems to be much harder to
trigger with my current setup (i.e. xorg-x11-6.7.0-0.5 and
kernel-smp-2.6.5-1.326 as opposed to xorg-x11-0.6.6-0.2004_03_30.5
and kernel-smp-2.6.5-1.309.i686) than before.  After an initial
"entertainment" I was running a test machine for 24 hours, and it
was not touched at least overnight, and so far no other incident
happened.

If somebody would like to look at results of "SysRq : Show State"
from the last recorded lockup, the one in my current configuration
and described in 'Comment #13', then I have that in my logs.

Comment 15 Michal Jaegermann 2004-04-22 18:08:01 UTC

I had again an incident similar to one described in a comment #13.
The next reboot was clean.  This time I have some differences in
Xorg.0.log.  Most of the time differences in recorded addresses
are by one bit but not always.  For example:

-(II) RADEON(0): [pci] Ring mapped at 0xf2d8a000
+(II) RADEON(0): [pci] Ring mapped at 0xf2e06000
-(II) RADEON(0): [pci] GART Texture map mapped at 0xf26a9000
+(II) RADEON(0): [pci] GART Texture map mapped at 0xf2725000

and few more. A full diff attached.

Comment 16 Michal Jaegermann 2004-04-22 18:09:42 UTC

Created attachment 99637 [details]
Differences in logs from lock up in X and a normal start

Comment 17 Dave Jones 2004-12-07 05:41:00 UTC

did this get resolved in the 2.6.9 based errata kernel ?

Comment 18 Michal Jaegermann 2004-12-22 05:23:30 UTC

I think that this problem may be gone by now.  Somewhat hard
to tell, as it was striking in a rather unpredictable way, but
I did not see anything of that sort recently.

Comment 19 Dave Jones 2005-04-16 05:44:57 UTC

Fedora Core 2 has now reached end of life, and no further updates will be
provided by Red Hat.  The Fedora legacy project will be producing further kernel
updates for security problems only.

If this bug has not been fixed in the latest Fedora Core 2 update kernel, please
try to reproduce it under Fedora Core 3, and reopen if necessary, changing the
product version accordingly.

Thank you.

Note You need to log in before you can comment on or make changes to this bug.