283161 – F8/F9 kernel hangs after a day or two of uptime

Bug 283161 - F8/F9 kernel hangs after a day or two of uptime

Summary: F8/F9 kernel hangs after a day or two of uptime

Keywords:
Status:	CLOSED WORKSFORME
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	kernel
Sub Component:
Version:	9
Hardware:	i686
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	Kernel Maintainer List
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2007-09-07 20:38 UTC by Jonathan Kamens
Modified:	2008-08-12 22:40 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2008-08-12 22:40:36 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
contents of /proc/interrupts (1.03 KB, text/plain) 2007-10-17 03:38 UTC, Jonathan Kamens	no flags	Details
lspci output (1.91 KB, text/plain) 2007-10-17 03:38 UTC, Jonathan Kamens	no flags	Details
Diff of dmidecodes for nonworking vs. working thinkpad. (3.68 KB, patch) 2007-11-26 09:39 UTC, Pekka Pietikäinen	no flags	Details \| Diff
lspci output - mschout (1.52 KB, text/plain) 2007-12-13 04:05 UTC, mschout	no flags	Details
View All

Description Jonathan Kamens 2007-09-07 20:38:10 UTC

I have been unable to use a Rawhide kernel on my machine in at least a month,
and possibly two.  Every time I try a new Rawhide kernel, it locks up hard after
a day or two of uptime.  I've most recently confirmed this with
2.6.23-0.164.rc5.fc8.

Since the kernel locks up hard, there's nothing in the log file to explain the
crash.

I don't know enough about kernel debugging to know how I can capture useful
information about the hang, but if there's any information I can provide which
will help to track down this problem, just tell me how to collect it and I'll be
happy to do so.

In the meantime, back I go to 2.6.22.5-71.fc7, which is rock solid for me.

Comment 1 Jonathan Kamens 2007-09-19 14:34:00 UTC

Still a problem in kernel-2.6.23-0.184.rc6.git4.fc8.

Comment 2 Chuck Ebbert 2007-09-19 15:25:46 UTC

Which kernel, x86_64 or i686?

Comment 3 Jonathan Kamens 2007-09-19 15:50:09 UTC

i686

Comment 4 Chuck Ebbert 2007-09-19 15:59:46 UTC

Can you try some kernel boot parameters ?
(don't mix options from different lines)

First try: nohz=off highres=off
If that doesn't help, try: nolapic_timer
Then: nolapic
And finally: noapic

Also, post the contents of /proc/interrupts and output of the lspci command
(without any of those options in use.)

Comment 5 Jonathan Kamens 2007-10-14 01:33:31 UTC

With 2.6.23-0.224.rc9.git6.fc8, I tried nohz=off highres=off and nolapic_timer.
 Both hung.

Then I tried nolapic, but this caused my computer to behave in all sorts of
weird ways, in particular, it was unable to acquire a DHCP address from Comcast
on boot (tried it twice), and I got a kernel oops like this:

Oct 13 21:17:12 jik2 kernel: BUG: unable to handle kernel NULL pointer
dereference at virtual address 00000000
Oct 13 21:17:12 jik2 kernel: printing eip: f894cb6c *pde = 6c926067 
Oct 13 21:17:12 jik2 kernel: Oops: 0000 [#1] SMP 
Oct 13 21:17:12 jik2 kernel: Modules linked in: sit tunnel4 pcspkr appletalk
fuse w83627hf hwmon_vid hwmon eeprom tun ipv6 nf_conntrack nfnetlink ext2
dm_mirror dm_mod floppy pwc snd_usb_audio snd_usb_lib snd_hwdep cx88_alsa
snd_via82xx gameport snd_ac97_codec ac97_bus snd_seq_dummy snd_seq_oss
snd_seq_midi_event snd_seq snd_pcm_oss tuner snd_mixer_oss snd_pcm
snd_mpu401_uart snd_rawmidi cx8800 snd_timer snd_seq_device snd_page_alloc
parport_pc cx88xx snd parport 8139too ir_common soundcore via_rhine i2c_algo_bit
i2c_viapro 8139cp video_buf tveeprom mii i2c_core videodev v4l1_compat
compat_ioctl32 v4l2_common btcx_risc button sg sr_mod cdrom sata_via pata_via
ata_generic libata sd_mod scsi_mod ext3 jbd mbcache uhci_hcd ohci_hcd ehci_hcd
Oct 13 21:17:12 jik2 kernel: CPU:    0
Oct 13 21:17:12 jik2 kernel: EIP:    0060:[<f894cb6c>]    Not tainted VLI
Oct 13 21:17:12 jik2 kernel: EFLAGS: 00210093   (2.6.23-0.224.rc9.git6.fc8 #1)
Oct 13 21:17:12 jik2 kernel: EIP is at rhine_interrupt+0x186/0x6b8 [via_rhine]
Oct 13 21:17:12 jik2 kernel: eax: 00000000   ebx: f7b3a600   ecx: 0239e000  
edx: 00000000
Oct 13 21:17:12 jik2 kernel: esi: f7b3a600   edi: 00000000   ebp: f5277e20  
esp: f5277de0
Oct 13 21:17:12 jik2 kernel: ds: 007b   es: 007b   fs: 00d8  gs: 0033  ss: 0068
Oct 13 21:17:12 jik2 kernel: Process ifconfig (pid: 4337, ti=f5277000
task=f525ad60 task.ti=f5277000)
Oct 13 21:17:12 jik2 kernel: Stack: c0449698 c0759568 00000000 00000001 00200246
f7b3a000 f8944c00 00001003 
Oct 13 21:17:12 jik2 kernel:        00000014 00000000 00200046 c1be2a98 00000220
00200202 f6555500 000000a0 
Oct 13 21:17:12 jik2 kernel:        f5277e40 c0462a4f f894c9e6 0000000b 01b3a290
f7b3a000 f7b3a000 00001202 
Oct 13 21:17:12 jik2 kernel: Call Trace:
Oct 13 21:17:12 jik2 kernel:  [<c0406463>] show_trace_log_lvl+0x1a/0x2f
Oct 13 21:17:12 jik2 kernel:  [<c0406513>] show_stack_log_lvl+0x9b/0xa3
Oct 13 21:17:12 jik2 kernel:  [<c04066d3>] show_registers+0x1b8/0x289
Oct 13 21:17:12 jik2 kernel:  [<c04068af>] die+0x10b/0x23e
Oct 13 21:17:12 jik2 kernel:  [<c06362e4>] do_page_fault+0x51c/0x5ed
Oct 13 21:17:12 jik2 kernel:  [<c0634a0a>] error_code+0x72/0x78
Oct 13 21:17:12 jik2 kernel:  [<c0462a4f>] request_irq+0xc1/0x112
Oct 13 21:17:12 jik2 kernel:  [<f894df3d>] rhine_open+0x3f/0x1c8 [via_rhine]
Oct 13 21:17:12 jik2 kernel:  [<c05cfd3a>] dev_open+0x31/0x6c
Oct 13 21:17:12 jik2 kernel:  [<c05cde91>] dev_change_flags+0xa3/0x156
Oct 13 21:17:12 jik2 kernel:  [<c060e765>] devinet_ioctl+0x207/0x50e
Oct 13 21:17:12 jik2 kernel:  [<c060ee13>] inet_ioctl+0x86/0xa4
Oct 13 21:17:12 jik2 kernel:  [<c05c4066>] sock_ioctl+0x1ac/0x1c9
Oct 13 21:17:12 jik2 kernel:  [<c04942fe>] do_ioctl+0x22/0x68
Oct 13 21:17:12 jik2 kernel:  [<c049458d>] vfs_ioctl+0x249/0x25c
Oct 13 21:17:12 jik2 kernel:  [<c04945e9>] sys_ioctl+0x49/0x64
Oct 13 21:17:12 jik2 kernel:  [<c040522e>] syscall_call+0x7/0xb
Oct 13 21:17:12 jik2 kernel:  =======================
Oct 13 21:17:12 jik2 kernel: Code: 03 00 00 83 e0 0f 89 45 e4 8d 86 3c 03 00 00
e8 c5 76 ce c7 e9 b6 01 00 00 8b 7d e4 8b 46 04 c1 e7 04 01 f8 83 3d 60 0a 95 f8
06 <8b> 18 7e 17 8b 55 e4 89 5c 24 08 c7 04 24 5d ec 94 f8 89 54 24 
Oct 13 21:17:12 jik2 kernel: EIP: [<f894cb6c>] rhine_interrupt+0x186/0x6b8
[via_rhine] SS:ESP 0068:f5277de0

I don't know if this has anything to do with the nolapic or is completely a
coincidence.  Also, I've had some trouble with DHCP to Comcast in the past, so
it could also be a coincidence that when I reverted to 2.6.22.5-71.fc7, DHCP
started working again.

I see that 2.6.23-6.fc8 is out, so I'm going to try updating to that and see if
the hangs persist, and if so, I'll start at the top and try all the parameters
you listed above once again and let you know how it turns out.

Comment 6 Chuck Ebbert 2007-10-15 18:18:58 UTC

See also:

https://fedoraproject.org/wiki/KernelCommonProblems

Comment 7 Jonathan Kamens 2007-10-17 03:37:21 UTC

Same results with 2.6.23-6.fc8.i686.  "nohz=off highres=off" hangs,
"nolapic_timer" hangs, "nolapic" takes forever to start udev and then can't
start my network card, and "noapic" hangs.

I will attach the additional information you requested.

Looks like it's once again back to the FC7 kernel.

A serial console is not an option.

Comment 8 Jonathan Kamens 2007-10-17 03:38:14 UTC

Created attachment 229431 [details]
contents of /proc/interrupts

Comment 9 Jonathan Kamens 2007-10-17 03:38:41 UTC

Created attachment 229441 [details]
lspci output

Comment 10 Chuck Ebbert 2007-10-17 16:54:12 UTC

This is along shot, but could you try disabling the VIA Rhine ethernet driver
and just use the RTL8139?

Comment 11 Jonathan Kamens 2007-10-17 17:00:59 UTC

Please be more specific; remember, you're the kernel developer, not me :-).

Do you mean that there are two different drivers that will work with my NIC, 
and you want me to use a different, one or that you want me to disable one of 
my two NIC cards?  The latter I can't do, because I need one NIC for my WAN 
connection and a second for my internal network.  If you mean the former, then 
tell me how to do it and I'll gladly try it.

Comment 12 Pekka Pietikäinen 2007-10-18 10:44:53 UTC

I'm seeing this too, even sysrq doesn't work. Could be a different hang, but ~=
two days is what it takes so I'll put my datapoints here vs. a separate bug. It
was a lot quicker to hang with earlier 2.6.23-pre's (Often within 5 mins, it
hung in the initial gdm prompt or maybe 5 mins after login. If it survived that
it was fine for a few days), with later ones it's more solid but eventually
hangs. Previously the keyboard leds flashed, so it must have caught an oops.
Doesn't do that anymore, so I guess even a serial console might not help (not
that the box even has a serial port...). 

Hardware is a Thinkpad X31, ipw2100 wireless, e100 ethernet (not used), mobility
radeon M6 graphics. Not suspending the box ever. So no hardware commonalities
with original poster, I believe. 2.6.23-6 running now, that fails too. Trying
2.6.23.1-23 from koji next. 2.6.22 is stable for me too.

           CPU0       
  0:   18582115    XT-PIC-XT        timer
  1:      49019    XT-PIC-XT        i8042
  2:          0    XT-PIC-XT        cascade
  7:          2    XT-PIC-XT        parport0
  8:          3    XT-PIC-XT        rtc
  9:      50501    XT-PIC-XT        acpi
 11:     294389    XT-PIC-XT        yenta, yenta, ehci_hcd:usb1, uhci_hcd:usb2, 
uhci_hcd:usb3, uhci_hcd:usb4, firewire_ohci, eth1, Intel 82801DB-ICH4 Modem, Int
el 82801DB-ICH4, radeon@pci:0000:01:00.0, eth0
 12:     380606    XT-PIC-XT        i8042
 14:      76743    XT-PIC-XT        libata
 15:          0    XT-PIC-XT        libata
NMI:          0 
LOC:          0 
ERR:          0
MIS:          0

00:00.0 Host bridge: Intel Corporation 82855PM Processor to I/O Controller (rev 03)
00:01.0 PCI bridge: Intel Corporation 82855PM Processor to AGP Controller (rev 03)
00:1d.0 USB Controller: Intel Corporation 82801DB/DBL/DBM (ICH4/ICH4-L/ICH4-M)
USB UHCI Controller #1 (rev 01)
00:1d.1 USB Controller: Intel Corporation 82801DB/DBL/DBM (ICH4/ICH4-L/ICH4-M)
USB UHCI Controller #2 (rev 01)
00:1d.2 USB Controller: Intel Corporation 82801DB/DBL/DBM (ICH4/ICH4-L/ICH4-M)
USB UHCI Controller #3 (rev 01)
00:1d.7 USB Controller: Intel Corporation 82801DB/DBM (ICH4/ICH4-M) USB2 EHCI
Controller (rev 01)
00:1e.0 PCI bridge: Intel Corporation 82801 Mobile PCI Bridge (rev 81)
00:1f.0 ISA bridge: Intel Corporation 82801DBM (ICH4-M) LPC Interface Bridge
(rev 01)
00:1f.1 IDE interface: Intel Corporation 82801DBM (ICH4-M) IDE Controller (rev 01)
00:1f.3 SMBus: Intel Corporation 82801DB/DBL/DBM (ICH4/ICH4-L/ICH4-M) SMBus
Controller (rev 01)
00:1f.5 Multimedia audio controller: Intel Corporation 82801DB/DBL/DBM
(ICH4/ICH4-L/ICH4-M) AC'97 Audio Controller (rev 01)
00:1f.6 Modem: Intel Corporation 82801DB/DBL/DBM (ICH4/ICH4-L/ICH4-M) AC'97
Modem Controller (rev 01)
01:00.0 VGA compatible controller: ATI Technologies Inc Radeon Mobility M6 LY
02:00.0 CardBus bridge: Ricoh Co Ltd RL5c476 II (rev aa)
02:00.1 CardBus bridge: Ricoh Co Ltd RL5c476 II (rev aa)
02:00.2 FireWire (IEEE 1394): Ricoh Co Ltd R5C552 IEEE 1394 Controller (rev 02)
02:02.0 Network controller: Intel Corporation PRO/Wireless LAN 2100 3B Mini PCI
Adapter (rev 04)
02:08.0 Ethernet controller: Intel Corporation 82801DB PRO/100 VE (MOB) Ethernet
Controller (rev 81)

Comment 13 Jonathan Kamens 2007-10-21 10:03:34 UTC

Still broken in 2.6.23.1-23.fc8.  nmi_watchdog=2 doesn't help.  I suppose
there's a  chance that it printed a stack trace on the console, which I couldn't
see because I was on m X VT rather than my console VT.  I'll try to remember to
switch to my console VT before walking away from my computer, but I'm concerned
that even then I might not see the stack trace, because the console will
screen-save before the lock-up.  Do you know off the top of your head how I turn
off the screen-save on a text VT?

Also, the ctrl-alt-sysrq sequences are ineffective.  It's *really* locked up.

Comment 14 Pekka Pietikäinen 2007-10-21 11:35:39 UTC

setterm -blank 0 should do the trick.

Comment 15 Pekka Pietikäinen 2007-10-23 20:18:30 UTC

Unscientifically seems to hang much quicker in -30 ( <  1 day), -6 actually went
up to 5 days before I booted to a later kernel (but it hung at least once too,
so that doesn't really say much). Running with nohz=off highres=off now. 

Mine's a ICH4 chipet, which the hrt3 patchset does a "Force enabled HPET at base
address 0xfed00000" to (with a comment in the patch doing that the HPET is not
documented to even exist on my chipset etc. and might not be safe)
Also "Local APIC disabled by BIOS -- you can enable it with "lapic"" so nolapic
won't help here, I'm sure. 

Symptoms/kernel changes would easily point to the highres timer patch. Then
again, our hardwares are quite different so if forcing HPET on ICH4 is a
problem, it won't explain your hangs... Easily could be different bugs too, but
let's keep hints here in one bug until we figure this out a bit better?

Comment 16 Chuck Ebbert 2007-10-26 18:59:30 UTC

What clocksource is the system using?

See the "system clock runs too fast/slow" secion of:

  https://fedoraproject.org/wiki/KernelCommonProblems

for directions on how to change the clocksource.

Comment 17 Jonathan Kamens 2007-10-28 02:20:13 UTC

My kernel is using the tsc clock.

I just hung with my VT set to my console and nmi_watchdog=2.  No backtrace was
printed to the console.  Any other suggestions?

Comment 18 Chuck Ebbert 2007-10-29 19:02:54 UTC

(In reply to comment #17)
> My kernel is using the tsc clock.
> 
> I just hung with my VT set to my console and nmi_watchdog=2.  No backtrace was
> printed to the console.  Any other suggestions?
> 

Try clocksource=acpi_pm

Comment 19 Pekka Pietikäinen 2007-10-30 12:33:35 UTC

Mine was HPET (running with acpi_pm now). nohz=off highress=off test got
interrupted since I had to suspend the box a few times and it failed to come out
of suspend once, so nothing conclusive there. Mo hangs during use but no long
test periods either...

Probably should borrow an identical laptop just for testing this issue :P

Comment 20 Pekka Pietikäinen 2007-10-31 20:06:13 UTC

acpi_pm hung too after a day. Resuming nohz=off highres=off test.

Comment 21 Pekka Pietikäinen 2007-11-04 11:11:05 UTC

Well, 3.5 days of uptime with nohz=off highres=off (with HPET, which is the
defalt on my laptop) so I guess it's "fixed". So I think the conclusion is that
my bug is different than yours (no surprise, hardware is so different)

Anything else I should try out? Just one of nohz/highres? File a separate bug
like "Thinkpad X31 hangs on F8 kernel without nohz=off highres=off" ?

Comment 22 Jonathan Kamens 2007-11-04 13:13:24 UTC

I've been up for over five days with clocksource=acpi_pm, so perhaps that
resolved the issue for me.  I'll let you know if I hang.

Comment 23 Pekka Pietikäinen 2007-11-08 16:55:02 UTC

Updating version/summary to F8, hopefully will be more noticeable to people that
are just installing F8 now (+rawhide is a moving target with 2.6.24rc, it might
even be fixed upstream there for all I know...). Someone on #fedora-devel
already needed clocksource=acpi_pm to boot at all.

Running with just highres=off now.

Comment 24 bsquare 2007-11-12 13:02:23 UTC

I've performed an upgrade from Fedora 7 to Fedora 8 thanks to yum using my usual
steps (Cf.
http://bertrandbenoit.blogspot.com/2007/09/upgrade-gnulinux-fedora.html) without
any problem.
Unfortunately, as soon as I have rebooted the computer, I have such hangs too.

I've tried lots of your proposals without success, so I'll give you explanations
for help.
This is the situation I've tried, one after the other (one boot for each),
without success :
 - with nohz=off highres=off kernel options,* 
 - with clocksource=acpi_pm kernel option,*
 - with nolapic_timer kernel option,*
 - with my last Fedora 7 Kernel,**
 - while not being connected to local network,*
 - more time only in text (checking for potential annoying new services), then
OpenOffice Writer edition.***

*: each time, my usual desktop starts (with Kate, xmms, Firefox, Thunderbird,
several Konsoles), I begin to write a new mail, and after only few seconds
(about 30s approximatively), the keyboard does not "answer" anymore BUT the
mouse pointer can be moved during about 5 seconds approximatively before being
freezed too.
It were then impossible to restart X with CTRL+ALT+SUPPR, neither to move to
another xterm
**: this case is perhaps more interesting, because the first 30/40 seconds
begins as wished (without hang), I've stressed my computer, launching a video
(mplayer), musics (xmms), and web pages with video (youTube). There were
absolutely no problem during about 5 minutes.
As soon as, I've tried a new time to answer to a mail (with Thunderbird), and
then same hang (keybord, then mouse) BUT the music (MP3) keeps on playing
without any problem, and the whole music, not only a little part in loop (like
someone explained before).
***: such a way, because Thunderbird seemed the "key", I've tried another steps.
The exactly same desktop started, I've closed all "graphical" frames (Firefox,
xmms, Thunderbird), and checked if it could not take source from new services.
Under a Konsole, I've been able to "work" during 30minutes, then I've make a
pause (during more than 1 hour), and the system was still alive, fine.
I've then launched OpenOffice Writer to try something else, and unfortunately,
after about 1/2 minutes, same issue, keybord, then mouse.

Hope all those information will help.

Comment 25 Pekka Pietikäinen 2007-11-12 13:43:01 UTC

Just highres=off hung within a day, nohz=off has been fine for a couple of days,
test continues still.

My guess for #24's problem is video card related. What video card (and if
nvidia/ati, what drivers). Could try with the vesa driver or disabling
acceleration on DRI in xorg.conf. If those help, then a separate bug should be
filed (under xorg-x11-drv-XXX ?).

Comment 26 bsquare 2007-11-12 13:53:54 UTC

(In reply to comment #25)
> Just highres=off hung within a day, nohz=off has been fine for a couple of days,
> test continues still.
> 
> My guess for #24's problem is video card related. What video card (and if
> nvidia/ati, what drivers). Could try with the vesa driver or disabling
> acceleration on DRI in xorg.conf. If those help, then a separate bug should be
> filed (under xorg-x11-drv-XXX ?).
I've an ATI Radeon X800 Pro.
Before upgrade, the radeon driver was used.
I've not checked what is now used (and I cannot until this evening).

Anyway, there is no chance the driver has been changed during upgrade without
any warning ?

Comment 27 bsquare 2007-11-13 07:28:53 UTC

Unfortunately, it seems that I still have the radeon driver in use. So there was
no change, and so no reason for hang.
Anyway, would not be amawazing this comes from graphical divers according to all
steps I've followed ? 
In addition, what would be the link between graphical driver and keyboard/mouse
? How could we explain the fact the mouse freezes lightly after the keyboard and
not in the absolute same time ?
For information, my mouse is USB connected.

Comment 28 Chuck Ebbert 2007-11-13 19:30:04 UTC

(In reply to comment #27)
> Unfortunately, it seems that I still have the radeon driver in use. So there was
> no change, and so no reason for hang.
> Anyway, would not be amawazing this comes from graphical divers according to all
> steps I've followed ? 
> In addition, what would be the link between graphical driver and keyboard/mouse
> ? How could we explain the fact the mouse freezes lightly after the keyboard and
> not in the absolute same time ?
> For information, my mouse is USB connected.

There were some radeon bugs fixed in kernel 2.6.23.1-49.

Comment 29 Pekka Pietikäinen 2007-11-14 14:38:35 UTC

Well, my guess was based on your description of more "graphical" apps causing
the hangs for you, that's why I recommended trying with the vesa driver to
confirm/disprove my hunch, since it doesn't seem to be kernel hardware timer
code related if the nohz etc. workarounds don't help you. And since -49 has
radeon bugs fixed, that's worth a try too.

Despite F7 and F8 both using radeon it's possible that there's a regression in
the F8 version of the radeon driver that occurs when some specific type of
hardware acceleration functionality happens to be used. The mouse/keyboard hang
timing is a valuable hint too. Note that the system can be in a very confused
state and the mouse cursor will still work, it's pretty much all handled at the
hardware level. Showing keypresses on the screen in X requires the system to be
in a pretty sane state.

In a related note, nohz=off has worked for > 4 days

Comment 30 bsquare 2007-11-14 19:22:58 UTC

(In reply to comment #28)
> (In reply to comment #27)
> > Unfortunately, it seems that I still have the radeon driver in use. So there was
> > no change, and so no reason for hang.
> > Anyway, would not be amawazing this comes from graphical divers according to all
> > steps I've followed ? 
> > In addition, what would be the link between graphical driver and keyboard/mouse
> > ? How could we explain the fact the mouse freezes lightly after the keyboard and
> > not in the absolute same time ?
> > For information, my mouse is USB connected.
> 
> There were some radeon bugs fixed in kernel 2.6.23.1-49.

Unfortunately, after my first comment I've upgraded to this one (from the
kernel-2.6.23.1-42.fc8) and have seen the same hang.

Comment 31 bsquare 2007-11-14 19:24:02 UTC

(In reply to comment #29)
> Well, my guess was based on your description of more "graphical" apps causing
> the hangs for you, that's why I recommended trying with the vesa driver to
> confirm/disprove my hunch, since it doesn't seem to be kernel hardware timer
> code related if the nohz etc. workarounds don't help you. And since -49 has
> radeon bugs fixed, that's worth a try too.

Currently, I've changed to vesa driver and have no problem for now (after 10/20
minutes of use).

> 
> Despite F7 and F8 both using radeon it's possible that there's a regression in
> the F8 version of the radeon driver that occurs when some specific type of
> hardware acceleration functionality happens to be used. The mouse/keyboard hang
> timing is a valuable hint too. Note that the system can be in a very confused
> state and the mouse cursor will still work, it's pretty much all handled at the
> hardware level. Showing keypresses on the screen in X requires the system to be
> in a pretty sane state.
Ok, thanks for information.

If it keeps on, perhaps I can temporarly use a F7 radeon driver version ?


> 
> In a related note, nohz=off has worked for > 4 days

Comment 32 Pekka Pietikäinen 2007-11-15 18:58:54 UTC

Worth a try, rpm -Uvh --oldpackage xorg-x11-drv-ati from fc7 and see how it
goes, with luck dependencies won't be an issue.

In any case, I'd recommend filing a bug under xorg-x11-drv-ati (quickly looking
there were a few "radeon crashes my box" bugs that might be related to your
problem, so check if there are any hints/workarounds there).  
Including lspci / Xorg.0.log / results of trying various options from man radeon
(In particular adding Option "RenderAccel" "off" or Option "DRI" "off" in
xorg.conf in the Device section) and the fc7 driver may help in getting the
problem resolved for good.

Comment 33 bsquare 2007-11-18 20:27:02 UTC

I have tried with fc7 driver.
It is better because I can use my computer during a more long time, but
unfortunately one hang occurred.
I have no more precision for now but it shows this temporarily "solution" is not
enough.

I have resolved some problems I had with pulseaudio after upgrade, do you think
it could be implicated ?

Comment 34 Pekka Pietikäinen 2007-11-26 09:38:45 UTC

Heh. I now have some weird clues.

2.6.23.8-62.fc8 still hangs. 2.6.24-0.39.rc3.git1.fc9 ran fine for 4 days (I had
to suspend it and it failed to come back up fully so I had to reboot, but that's
a different bug in any case). .24 doesn't have the highres timer patches (yet?),
does it? 

Another X31 ran -62 over the weekend (no load though) without any workarounds
just fine. Differences are very minor, older BIOS+aironet (not ipw2100) on the
one that doesn't need nohz=off. dmidecode diff attached.

Comment 35 Pekka Pietikäinen 2007-11-26 09:39:45 UTC

Created attachment 268671 [details]
Diff of dmidecodes for nonworking vs. working thinkpad.

Comment 36 Chuck Ebbert 2007-11-26 16:52:34 UTC

(In reply to comment #35)
> Created an attachment (id=268671) [edit]
> Diff of dmidecodes for nonworking vs. working thinkpad.
> 

Can the older BIOS be updated?

Comment 37 Pekka Pietikäinen 2007-11-27 11:27:02 UTC

Done, running exactly the same version of BIOS (3.02)/embedded controller (1.08)
software now. Oh, I lied. Working one also has e1000, broken one e100 (but not
used in either, NM maybe polls it but that's it). I'll let it run for 3-4 days
(or until it crashes), if that succeeds I'll swap harddrives and start using
this one in the same ways I've used the misbehaving one.

I have a backup of the old ACPI DSDT btw., just in case it's that sort of thing...

Comment 38 Pekka Pietikäinen 2007-11-29 16:32:30 UTC

Updated BIOS+airo X31 up & running for 2.5 days, hanging ipw2100 X31 with mv
ipw2100.ko ipw2100.no and ath5k used instead up for a day (2.6.23-based now), so
would seem like a hrt patch + nohz + ipw2100 thing? Testing continues.

Comment 39 Pekka Pietikäinen 2007-11-29 18:22:49 UTC

Pah. It hung even with ath5k being used & ipw2100 not getting loaded, so there
goes that explanation. What a mystery... Oh well, at least there are two
workarounds (nohz and .24), latter might start getting affected at some point I
suppose...

Comment 40 mschout 2007-12-10 15:24:15 UTC

I need nohz=off also.  I've also been seeing these hangs using the F8 kernels. 
I never had this with the F7 kernels.  I've been running with nohz=off for about
a week and that seems to have fixed the issue for me. Without nohz=off my
machine would hang within a day. I am using the nvidia module, so its possible
that could be a factor I guess. I'd be happy to provide any lspci etc if it
would be helpful.

Comment 41 Pekka Pietikäinen 2007-12-12 09:57:47 UTC

Please do, in case it helps finding some sort of hardware pattern (almost
identical laptop up for two weeks now without workarounds :O ) 
If you can survive without the nvidia module to get at least one hang, that'd be
super, then we can rule that one out.

* Mon Dec 10 2007 Chuck Ebbert <cebbert> 2.6.23.9-87
- highres-timers: update to -hrt4 (#394981); includes hang fix
 Fri Dec 07 2007 Chuck Ebbert <cebbert> 2.6.23.9-84
- highres-timers: fix possible hang

needs to be tested by both of us, I think, btw. (koji.fedoraproject.org has them
if updates-testing doesn't already). #394981 looks different than our bug, the
-84 thing maybe could possibly be it.

Comment 42 mschout 2007-12-13 04:05:44 UTC

Created attachment 286551 [details]
lspci output - mschout

Comment 43 mschout 2008-01-17 16:56:03 UTC

Well, I managed to hang 2.6.23.9-85.fc8 both with and without nohz=off, so this
seems even worse than -63 was for me.  I've tried to run without the nvidia
driver, but I can not get my display to configure at all.  system-config-display
fails to configure it (and makes the screen flash wildly), and X -configure
fails to do it also.  I think the reason is that I have a quite new display
(dual Dell 2208WFP's).  I even tried unplugging and just getting a single
display configured with the same results.  The failure to config the display is
a separate issue obviously, but I can't easily run without the nvidia driver
because of that at this time.

I'm running with both nohz=off and highres=off now.  If that hangs also I will
post an update, then go back to -65 or the FC7 kernel for now.

Comment 44 mschout 2008-01-21 15:44:20 UTC

I've been running for about 4 days now with "nohz=off highres=off" and no hangs
with -85.  So that stops the hangs for me.  Guessing there are some problems
with the tickless/highres patches.

Comment 45 Pekka Pietikäinen 2008-01-28 16:03:24 UTC

New datapoints, 2.6.24-based kernels started doing the same at some point.
2.6.24-0.39.rc3.git1.fc9 worked without nohz=off, 0.150.rc7 hangs like 2.6.23.
Quite a few revisions in the middle, alas :(

Additional datapoint is from 2.6.24-0.115.rc5.git5.fc9, but that one is with
vmware modules loaded (installed vmware, ran windows inside it for some time,
stopped using vmware and it hung in 15 mins).

Comment 46 Pekka Pietikäinen 2008-01-31 15:04:47 UTC

And yet more, I started bisecting the thing. 2.6.24-2 compiled without any
fedora patches up for 3 days. I'll let it run for a bit more and then whip up a
new 2.6.24-2 with only some of the fedora patches. I think I'll try with just
utrace first unless there are any educated guesses on where to start.

Comment 47 Pekka Pietikäinen 2008-02-04 19:54:21 UTC

_SMELLS_ like utrace. Vanilla 2.6.24-2 was happy for 5 days, booted a 2.6.24-2
with some of the Fedora patches applied and it hung after a day. I applied
everything until

.. 

%if 0
ApplyPatch linux-2.6-execshield.patch

(which means utrace + some ppc patches, -mtune=generic)

Next attempt: leave out utrace and apply everything else.

Comment 48 Pekka Pietikäinen 2008-02-15 08:04:39 UTC

ApplyPatch linux-2.6-utrace-tracehook.patch
ApplyPatch linux-2.6-utrace-regset.patch
#ApplyPatch linux-2.6-utrace-core.patch
#ApplyPatch linux-2.6-utrace-ptrace-compat.patch

(+ arch-specific ones below them) hangs as well. utrace commented out but
everything else applied ran fine for 2.5 days or so (should have let it run
longer though, sometimes it takes almost 3 days for the hang to happen, so I
can't be 100% sure. Still it's either first two patches of utrace or
-mtune=generic).

Comment 49 Pekka Pietikäinen 2008-02-17 16:22:13 UTC

linux-2.6-utrace-tracehook.patch was enough to trigger it. I put some RPM's 
of 2.6.24.2-4 with utrace commented out in 
http://www.ee.oulu.fi/~pp/bz283161/ so other people can test whether it's the
same thing for them. Looks like the utrace included in the .24 kernels isn't the
latest one from Roland, so there might be relevant fixes there too. But let's
figure out whether we really can blame utrace first ;)

I also pinged Roland.

Comment 50 Pekka Pietikäinen 2008-02-18 21:21:54 UTC

Scratch the utrace theory, even the noutrace kernel hung :(
*continues twiddling*

Comment 51 Pekka Pietikäinen 2008-02-24 15:52:25 UTC

Well, 2.6.24.2-4 (sans utrace) actually hung even with nohz=off. Me thinks
ApplyPatch linux-2.6-highres-timers.patch was that one (That's something
the f8 kernel branch has and the f9 one doesn't) and utrace is still
a suspect.

2.6.25-0.40.rc1.git2.fc9 is actually solid without any extra options (up
for almost 5 days now). That one doesn't have a special highres patch and utrace
is commented out in the standard specfile.

Comment 52 Dave Jones 2008-02-24 17:43:26 UTC

Odd, none of the 2.6.24 kernels should have had a highres-timers patch.
I just checked cvs, and kernel-2.6.24.2-11.fc8 doesn't have one.

Comment 53 Pekka Pietikäinen 2008-02-24 21:49:30 UTC

Hmn, indeed, I diffed the .25 against a 2.6.23.15-137 specfile by accident when
looking for clues, not 24.2-4 (which was the hangs in any case kernel) so
highres wasn't there. So nothing "obvious" between the vanilla .24 that worked
and the slightly newer .24 that didn't... What a puzzle indeed :P At least
rawhide now works a-ok (any bets on how long?)! ;)

Comment 54 Pekka Pietikäinen 2008-03-30 18:22:28 UTC

Problem resurfaced between 
kernel-2.6.25-0.136.rc6.git5.fc9.i686 (everything including this in the 2.6.25
series has been a-ok) and kernel-2.6.25-0.163.rc7.git1.fc9.i686 (~= 2 days and hang)

utrace got reenabled in 0.139.

Comment 55 Pekka Pietikäinen 2008-08-11 10:56:41 UTC

Updated to 9 since it did show up on F9 kernels too. The laptop that had the problems kicked the bucket a week ago so I can no longer reproduce it. Before it died nohz=off remained a reliable workaround for 20 day+ uptimes, so it being a hardware thing before is unlikely (but possible).

I'm fine with WORKSFORME. Someone on the Cc list please add a comment if still need command line things with latest F9 kernels, otherwise probably best to close and start new bugs as necessary?

Note You need to log in before you can comment on or make changes to this bug.