Description of problem: I found my test machine stuck with a static picture from a "lightning" xscreensaver. No local access at all but it was still possible to get there over a network. 'top' had an interesting display: -) PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 9833 root 25 0 93068 13m 80m R 99.8 2.6 49:41.18 X ..... 'ps' showed 'lightning' in SN state and 'xscreensaver' as S. Attaching 'strace' to these two processes gave both of them stuck in: select(4, [3], NULL, NULL, NULL <unfinished ...> With a help of SysRq - Show Regs gives this: Pid: 9833, comm: X EIP: 0060:[<e0b468ca>] CPU: 0 EIP is at radeon_do_wait_for_fifo+0x23/0x45 [radeon] EFLAGS: 00003246 Not tainted (2.6.5-1.309smp) EAX: 80116100 EBX: df434400 ECX: 3f4634a4 EDX: 00000000 ESI: 00000308 EDI: 00000040 EBP: 00006444 DS: 007b ES: 007b CR0: 8005003b CR2: b43c4000 CR3: 1ffafc60 CR4: 000006f0 Call Trace: [<e0b468fe>] radeon_do_wait_for_idle+0x12/0x55 [radeon] [<e0b4337a>] radeon_ioctl+0xe8/0xf5 [radeon] [<e0b47d0f>] radeon_cp_idle+0x0/0x6c [radeon] [<c016b7ce>] sys_ioctl+0x24a/0x2b1 [<c01062f3>] syscall_call+0x7/0xb An output from "SysRq: Show State" is attached. Attempts to kill X or its subprocesses did not haveany discernible effects. Similarly 'telinit 3'. 'shutdown -h now' resulted only in a communication loss but a machine was not going down (or at least not in a hurry :-). Recovered eventually with a power button. Version-Release number of selected component (if applicable): xorg-x11-0.6.6-0.2004_03_30.5 kernel-smp-2.6.5-1.309.i686 xscreensaver-4.14-4
Created attachment 99345 [details] An output from "Show State" on a lightning stricken machine
Should have mention that: "ATI Radeon VE/7000 QY (AGP/PCI)" (ChipID = 0x5159) in a PCI socket and with 64 Megs of a video memory.
Reassigning to kernel component...
Got another one today with the same content of 'Show Regs'. Here is a call trace: [<e0b468fe>] radeon_do_wait_for_idle+0x12/0x55 [radeon] [<e0b4337a>] radeon_ioctl+0xe8/0xf5 [radeon] [<c02cbb45>] schedule_timeout+0x13/0xae [<c01eff0e>] read_chan+0x389/0x960 [<c011d2a9>] scheduler_tick+0x56e/0x576 [<c011d2b1>] default_wake_function+0x0/0xc [<c011d2b1>] default_wake_function+0x0/0xc [<c01eb096>] tty_read+0xf7/0x177 [<c01599e0>] vfs_read+0xb8/0xe4 [<c0159bb9>] sys_read+0x2c/0x42 [<c01062f3>] syscall_call+0x7/0xb The only difference is that this time a screensaver was 'wander' instead of 'lightning'. I will try later if changing kernel versions makes difference. A troublesome aspect is that the problem hits "at random".
I can take this one, its not really a kernel issue even though its in a kernel module. I've tracked down other hangs like this previously with DRM and radeon. If possible it would help if you could get a stack trace of the screen saver when it hangs.
X server died on me once again, taking away my console obviously enough, but I am not sure if this is the same thing as before. This time monitor displays a big "built-in" alert OUT OF RANGE Hf: 28KHz-70KHz Vf: 40Hz-120Hz 18.0Khz/ 22Hz The last line supposedly gives parameters of a signal fed to a monitor right now. There are similarities. 'top' reports: PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 1248 root 25 0 88476 9428 79m R 99.9 1.8 31:22.62 X and in 'ps' one can find 4157 0.0 0.4 5324 2216 ? S 14:11 0:00 xscreensaver -nosplash 4228 0.0 0.2 3720 1324 ? SN 14:51 0:00 \_ mountain -root OTOH "Show Regs" gives another location: SysRq : Show Regs Pid: 1248, comm: X EIP: 0060:[<c01127cf>] CPU: 0 EIP is at delay_pmtmr+0xb/0x13 EFLAGS: 00003283 Not tainted (2.6.5-1.309smp) EAX: 42414ef5 EBX: 000003e8 ECX: 42414bb1 EDX: 00000b3d ESI: 00000000 EDI: 00006daa EBP: 00000020 DS: 007b ES: 007b CR0: 8005003b CR2: b6077000 CR3: 1ffaf860 CR4: 000006f0 Call Trace: [<c01bc421>] __delay+0x9/0xa [<e0b47ec5>] radeon_freelist_get+0xd6/0x10b [radeon] [<e0b47fdb>] radeon_cp_get_buffers+0x3b/0xe5 [radeon] [<e0b481d5>] radeon_cp_buffers+0x150/0x1a7 [radeon] [<c0286429>] rt_garbage_collect+0x5b/0x2e9 [<e0b4337a>] radeon_ioctl+0xe8/0xf5 [radeon] [<e0b48085>] radeon_cp_buffers+0x0/0x1a7 [radeon] [<c0286429>] rt_garbage_collect+0x5b/0x2e9 [<c016b7ce>] sys_ioctl+0x24a/0x2b1 [<c01062f3>] syscall_call+0x7/0xb [<c0286429>] rt_garbage_collect+0x5b/0x2e9 Here is what I can get from backtrace for 'xscreensaver' (gdb) bt #0 0x003b07a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2 #1 0x0047bfcd in ___newselect_nocancel () from /lib/tls/libc.so.6 #2 0x009c2322 in _XPollfdCacheDel () from /usr/X11R6/lib/libX11.so.6 #3 0x009c3241 in _XRead () from /usr/X11R6/lib/libX11.so.6 #4 0x009c3db3 in _XReply () from /usr/X11R6/lib/libX11.so.6 #5 0x009ba3eb in XQueryTree () from /usr/X11R6/lib/libX11.so.6 #6 0x080533c7 in ?? () #7 0x08a65ee0 in ?? () #8 0x00a0003b in _XlcAddCT () from /usr/X11R6/lib/libX11.so.6 #9 0x08053583 in ?? () #10 0xbff10da0 in ?? () #11 0x00a0003b in _XlcAddCT () from /usr/X11R6/lib/libX11.so.6 #12 0x008d0f88 in _XtRemoveAllInputs () from /usr/X11R6/lib/libXt.so.6 #13 0x008d1311 in XtAppNextEvent () from /usr/X11R6/lib/libXt.so.6 #14 0x08054128 in ?? () #15 0x08a61270 in ?? () #16 0xbff10c40 in ?? () #17 0x00000000 in ?? () and this is for 'xscreensaver/mountain' (gdb) bt #0 0x003b07a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2 #1 0x0047bfcd in ___newselect_nocancel () from /lib/tls/libc.so.6 #2 0x009c2322 in _XPollfdCacheDel () from /usr/X11R6/lib/libX11.so.6 #3 0x009c3241 in _XRead () from /usr/X11R6/lib/libX11.so.6 #4 0x009c3db3 in _XReply () from /usr/X11R6/lib/libX11.so.6 #5 0x009bf224 in XSync () from /usr/X11R6/lib/libX11.so.6 #6 0x0804bd7d in ?? () #7 0x08fb8760 in ?? () #8 0x00000000 in ?? () (I am behind not too speedy DSL connection and I do not have debugging packages installed but all X stuff is still at xorg-x11-0.6.6-0.2004_03_30.5 level and 2.6.5-1.309smp kernel).
I haven't succeeded in reproducing this yet. How long does it usually take before the problem manifests itself?
Hard to tell. I have to leave the whole thing usually for a few hours but I had occasions when this happened below an hour. I wonder if SMP is not relevant here.
Hm, got another one but this time no radeon is evident anywhere. SysRq : Show Regs Pid: 0, comm: swapper EIP: 0060:[<c0104041>] CPU: 1 EIP is at default_idle+0x29/0x2c EFLAGS: 00000246 Not tainted (2.6.5-1.309smp) EAX: 00000000 EBX: 00000000 ECX: c0104018 EDX: dfedc000 ESI: 00000000 EDI: 00000000 EBP: 00000000 DS: 007b ES: 007b CR0: 8005003b CR2: 0031e940 CR3: 1ffaf6a0 CR4: 000006f0 Call Trace: [<c010409d>] cpu_idle+0x26/0x3b [<c01216c2>] call_console_drivers+0xbe/0xe3 [<c0121926>] printk+0x1dd/0x213 but effects are similar. Only a network access. In 'ps' output _everything_ is listed in a state of S or S<something> and xscreensaver does not have any child listed. Hard to imagine that any or recent updates could have such effect (it is still the same xorg-x11 and the same kernel but some other things were updated). A backtrace from xscreensaver looks exactly like before. An attempt to 'chvt 1' hangs like that in 'strace': ...... open("/dev/tty", O_RDWR) = 3 ioctl(3, 0x4b33, 0xbff06f63) = -1 EINVAL (Invalid argument) close(3) = 0 open("/dev/tty0", O_RDWR) = 3 ioctl(3, 0x4b33, 0xbff06f63) = 0 ioctl(3, 0x5606, 0x1) = 0 ioctl(3, 0x5607 If I try 'telinit 3' then this ends like that: ..... open("/dev/initctl", O_WRONLY) = 3 write(3, "i\31\t\3\1\0\0\0003\0\0\0\5\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 384) = 384 close(3) = 0 alarm(0) = 3 exit_group(0) = ? and nothing visible happens. X is still in an output of 'ps'.
FWIW, with a very similar configuration it ran overnight without any problems. I'm sure you already checked this but were there any suspicious messages in either /var/log/messages or /var/log/XFree86.0.log? BTW, the name of the X log file has changed recently to Xorg.0.log but I'm pretty sure the rpm you have installed is still using the old name. When it was hung like this were you able to ssh in and do an strace of the Xorg process?
> I'm sure you already checked this but were there any > suspicious messages in either /var/log/messages or > /var/log/XFree86.0.log? I am afraid that I could not find anything suspicious/unusual. > BTW, the name of the X log file has changed > recently to Xorg.0.log Indeed, but I kept so far xorg-x11-0.6.6-0.2004_03_30.5 not to change environment. I may try to update that and a kernel to the most recent ones available to me and see what will happen. > When it was hung like this were you able to ssh in and > do an strace of the Xorg process? I mentioned that in my original report. 'strace' shows it sitting in select, like that: select(4, [3], NULL, NULL, NULL and that call never terminates. 'xscreensaver' process indeed was showing the same. BTW - this particular box is an SMP Athlon on a Tyan board and it did run RH 7.3 installation without any incidents. I may retest that just to be sure but this one is hacky due to missing video drivers.
It turned out that I have RH9 handy and I was running right now, on the same machine(!) the same experiment for the past 23 hours using kernel-smp-2.4.20-30.9.athlon, XFree86-4.3.0-2.90.55 and xscreensaver-4.07-2. Nothing bad happened nor it looks like that it is going to. This appears to eliminate a possibility that a hardware is acting up. One thing less obvious is that with RH9 I am using kernel.athlon while with FC2 there is no such thing so kernel.i686 comes into a play instead. No idea if this may be of any relevance.
After switching to xorg-x11-6.7.0-0.5 and kernel-smp-2.6.5-1.326 the first attempt to boot into X died already in gdmgreeter with an already familiar output from 'top': PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 1194 root 25 0 87648 8480 78m R 99.9 1.6 5:33.11 X ..... The corresponding fragment from 'ps' looked roughly 1143 0.0 0.4 S 10:31 0:00 /usr/bin/gdm-binary -nodaemon 1193 0.0 0.5 S 10:31 0:00 \_ /usr/bin/gdm-binary -nodaemon 1194 97.2 1.6 R 10:31 6:00 \_ /usr/X11R6/bin/X :0 -audit 0 -auth /var/gdm/:0.Xauth -nolisten tcp vt7 1226 0.2 1.7 S 10:31 0:00 \_ /usr/bin/gdmgreeter A screen was messed up with a garbled strip at the top, a horizontal line of a red and white dashes two thirds of a screen down from the left edge roughly to the middle of a screen, and none of normal greeter elements whatsoever. 'gdmgreeter' sits in non-returning 'select' call. Nothing particular in logs and sysrq gives for a change that: SysRq : Show Regs Pid: 1194, comm: X EIP: 0060:[<02111e0f>] CPU: 1 EIP is at delay_pmtmr+0xb/0x13 EFLAGS: 00003283 Not tainted (2.6.5-1.326smp) EAX: def9c74c EBX: 000003e8 ECX: def9c673 EDX: 000000a4 ESI: 00000000 EDI: 00000000 EBP: 00000020 DS: 007b ES: 007b CR0: 8005003b CR2: f6f72000 CR3: 003aa000 CR4: 000006f0 Call Trace: [<021b61e1>] __delay+0x9/0xa [<22adde29>] radeon_freelist_get+0xd6/0x10b [radeon] [<22addf2a>] radeon_cp_get_buffers+0x26/0x84 [radeon] [<22ade084>] radeon_cp_buffers+0xfc/0x124 [radeon] [<22ad9748>] radeon_ioctl+0xe8/0xf5 [radeon] [<22addf88>] radeon_cp_buffers+0x0/0x124 [radeon] [<02167c38>] sys_ioctl+0x23d/0x2a0 [<021059e8>] sys_sigreturn+0x113/0x138 The second attempt to boot ended up with a fully dead machine the moment X tried to start. A blank screen, no network, no keyboard, nothing but a power switch. The third attempt resulted surprisingly in a normal login and so far the whole thing runs. I got reported many times alert about xkb errors but a keyboard seems to be ok. My guess would be that something is scribbling over a memory which does not belong to it and that is why things "move" and some, seemingly similar, configurations may be "lucky".
If this is a race condition then it seems to be much harder to trigger with my current setup (i.e. xorg-x11-6.7.0-0.5 and kernel-smp-2.6.5-1.326 as opposed to xorg-x11-0.6.6-0.2004_03_30.5 and kernel-smp-2.6.5-1.309.i686) than before. After an initial "entertainment" I was running a test machine for 24 hours, and it was not touched at least overnight, and so far no other incident happened. If somebody would like to look at results of "SysRq : Show State" from the last recorded lockup, the one in my current configuration and described in 'Comment #13', then I have that in my logs.
I had again an incident similar to one described in a comment #13. The next reboot was clean. This time I have some differences in Xorg.0.log. Most of the time differences in recorded addresses are by one bit but not always. For example: -(II) RADEON(0): [pci] Ring mapped at 0xf2d8a000 +(II) RADEON(0): [pci] Ring mapped at 0xf2e06000 -(II) RADEON(0): [pci] GART Texture map mapped at 0xf26a9000 +(II) RADEON(0): [pci] GART Texture map mapped at 0xf2725000 and few more. A full diff attached.
Created attachment 99637 [details] Differences in logs from lock up in X and a normal start
did this get resolved in the 2.6.9 based errata kernel ?
I think that this problem may be gone by now. Somewhat hard to tell, as it was striking in a rather unpredictable way, but I did not see anything of that sort recently.
Fedora Core 2 has now reached end of life, and no further updates will be provided by Red Hat. The Fedora legacy project will be producing further kernel updates for security problems only. If this bug has not been fixed in the latest Fedora Core 2 update kernel, please try to reproduce it under Fedora Core 3, and reopen if necessary, changing the product version accordingly. Thank you.