Bug 283161
Summary: | F8/F9 kernel hangs after a day or two of uptime | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | Jonathan Kamens <jik> | ||||||||||
Component: | kernel | Assignee: | Kernel Maintainer List <kernel-maint> | ||||||||||
Status: | CLOSED WORKSFORME | QA Contact: | Fedora Extras Quality Assurance <extras-qa> | ||||||||||
Severity: | medium | Docs Contact: | |||||||||||
Priority: | medium | ||||||||||||
Version: | 9 | CC: | moneta.mace, mschout, petrosyan, pp | ||||||||||
Target Milestone: | --- | ||||||||||||
Target Release: | --- | ||||||||||||
Hardware: | i686 | ||||||||||||
OS: | Linux | ||||||||||||
Whiteboard: | |||||||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||||||
Doc Text: | Story Points: | --- | |||||||||||
Clone Of: | Environment: | ||||||||||||
Last Closed: | 2008-08-12 22:40:36 UTC | Type: | --- | ||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||
Documentation: | --- | CRM: | |||||||||||
Verified Versions: | Category: | --- | |||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||
Embargoed: | |||||||||||||
Attachments: |
|
Description
Jonathan Kamens
2007-09-07 20:38:10 UTC
Still a problem in kernel-2.6.23-0.184.rc6.git4.fc8. Which kernel, x86_64 or i686? i686 Can you try some kernel boot parameters ? (don't mix options from different lines) First try: nohz=off highres=off If that doesn't help, try: nolapic_timer Then: nolapic And finally: noapic Also, post the contents of /proc/interrupts and output of the lspci command (without any of those options in use.) With 2.6.23-0.224.rc9.git6.fc8, I tried nohz=off highres=off and nolapic_timer. Both hung. Then I tried nolapic, but this caused my computer to behave in all sorts of weird ways, in particular, it was unable to acquire a DHCP address from Comcast on boot (tried it twice), and I got a kernel oops like this: Oct 13 21:17:12 jik2 kernel: BUG: unable to handle kernel NULL pointer dereference at virtual address 00000000 Oct 13 21:17:12 jik2 kernel: printing eip: f894cb6c *pde = 6c926067 Oct 13 21:17:12 jik2 kernel: Oops: 0000 [#1] SMP Oct 13 21:17:12 jik2 kernel: Modules linked in: sit tunnel4 pcspkr appletalk fuse w83627hf hwmon_vid hwmon eeprom tun ipv6 nf_conntrack nfnetlink ext2 dm_mirror dm_mod floppy pwc snd_usb_audio snd_usb_lib snd_hwdep cx88_alsa snd_via82xx gameport snd_ac97_codec ac97_bus snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_pcm_oss tuner snd_mixer_oss snd_pcm snd_mpu401_uart snd_rawmidi cx8800 snd_timer snd_seq_device snd_page_alloc parport_pc cx88xx snd parport 8139too ir_common soundcore via_rhine i2c_algo_bit i2c_viapro 8139cp video_buf tveeprom mii i2c_core videodev v4l1_compat compat_ioctl32 v4l2_common btcx_risc button sg sr_mod cdrom sata_via pata_via ata_generic libata sd_mod scsi_mod ext3 jbd mbcache uhci_hcd ohci_hcd ehci_hcd Oct 13 21:17:12 jik2 kernel: CPU: 0 Oct 13 21:17:12 jik2 kernel: EIP: 0060:[<f894cb6c>] Not tainted VLI Oct 13 21:17:12 jik2 kernel: EFLAGS: 00210093 (2.6.23-0.224.rc9.git6.fc8 #1) Oct 13 21:17:12 jik2 kernel: EIP is at rhine_interrupt+0x186/0x6b8 [via_rhine] Oct 13 21:17:12 jik2 kernel: eax: 00000000 ebx: f7b3a600 ecx: 0239e000 edx: 00000000 Oct 13 21:17:12 jik2 kernel: esi: f7b3a600 edi: 00000000 ebp: f5277e20 esp: f5277de0 Oct 13 21:17:12 jik2 kernel: ds: 007b es: 007b fs: 00d8 gs: 0033 ss: 0068 Oct 13 21:17:12 jik2 kernel: Process ifconfig (pid: 4337, ti=f5277000 task=f525ad60 task.ti=f5277000) Oct 13 21:17:12 jik2 kernel: Stack: c0449698 c0759568 00000000 00000001 00200246 f7b3a000 f8944c00 00001003 Oct 13 21:17:12 jik2 kernel: 00000014 00000000 00200046 c1be2a98 00000220 00200202 f6555500 000000a0 Oct 13 21:17:12 jik2 kernel: f5277e40 c0462a4f f894c9e6 0000000b 01b3a290 f7b3a000 f7b3a000 00001202 Oct 13 21:17:12 jik2 kernel: Call Trace: Oct 13 21:17:12 jik2 kernel: [<c0406463>] show_trace_log_lvl+0x1a/0x2f Oct 13 21:17:12 jik2 kernel: [<c0406513>] show_stack_log_lvl+0x9b/0xa3 Oct 13 21:17:12 jik2 kernel: [<c04066d3>] show_registers+0x1b8/0x289 Oct 13 21:17:12 jik2 kernel: [<c04068af>] die+0x10b/0x23e Oct 13 21:17:12 jik2 kernel: [<c06362e4>] do_page_fault+0x51c/0x5ed Oct 13 21:17:12 jik2 kernel: [<c0634a0a>] error_code+0x72/0x78 Oct 13 21:17:12 jik2 kernel: [<c0462a4f>] request_irq+0xc1/0x112 Oct 13 21:17:12 jik2 kernel: [<f894df3d>] rhine_open+0x3f/0x1c8 [via_rhine] Oct 13 21:17:12 jik2 kernel: [<c05cfd3a>] dev_open+0x31/0x6c Oct 13 21:17:12 jik2 kernel: [<c05cde91>] dev_change_flags+0xa3/0x156 Oct 13 21:17:12 jik2 kernel: [<c060e765>] devinet_ioctl+0x207/0x50e Oct 13 21:17:12 jik2 kernel: [<c060ee13>] inet_ioctl+0x86/0xa4 Oct 13 21:17:12 jik2 kernel: [<c05c4066>] sock_ioctl+0x1ac/0x1c9 Oct 13 21:17:12 jik2 kernel: [<c04942fe>] do_ioctl+0x22/0x68 Oct 13 21:17:12 jik2 kernel: [<c049458d>] vfs_ioctl+0x249/0x25c Oct 13 21:17:12 jik2 kernel: [<c04945e9>] sys_ioctl+0x49/0x64 Oct 13 21:17:12 jik2 kernel: [<c040522e>] syscall_call+0x7/0xb Oct 13 21:17:12 jik2 kernel: ======================= Oct 13 21:17:12 jik2 kernel: Code: 03 00 00 83 e0 0f 89 45 e4 8d 86 3c 03 00 00 e8 c5 76 ce c7 e9 b6 01 00 00 8b 7d e4 8b 46 04 c1 e7 04 01 f8 83 3d 60 0a 95 f8 06 <8b> 18 7e 17 8b 55 e4 89 5c 24 08 c7 04 24 5d ec 94 f8 89 54 24 Oct 13 21:17:12 jik2 kernel: EIP: [<f894cb6c>] rhine_interrupt+0x186/0x6b8 [via_rhine] SS:ESP 0068:f5277de0 I don't know if this has anything to do with the nolapic or is completely a coincidence. Also, I've had some trouble with DHCP to Comcast in the past, so it could also be a coincidence that when I reverted to 2.6.22.5-71.fc7, DHCP started working again. I see that 2.6.23-6.fc8 is out, so I'm going to try updating to that and see if the hangs persist, and if so, I'll start at the top and try all the parameters you listed above once again and let you know how it turns out. Same results with 2.6.23-6.fc8.i686. "nohz=off highres=off" hangs, "nolapic_timer" hangs, "nolapic" takes forever to start udev and then can't start my network card, and "noapic" hangs. I will attach the additional information you requested. Looks like it's once again back to the FC7 kernel. A serial console is not an option. Created attachment 229431 [details]
contents of /proc/interrupts
Created attachment 229441 [details]
lspci output
This is along shot, but could you try disabling the VIA Rhine ethernet driver and just use the RTL8139? Please be more specific; remember, you're the kernel developer, not me :-). Do you mean that there are two different drivers that will work with my NIC, and you want me to use a different, one or that you want me to disable one of my two NIC cards? The latter I can't do, because I need one NIC for my WAN connection and a second for my internal network. If you mean the former, then tell me how to do it and I'll gladly try it. I'm seeing this too, even sysrq doesn't work. Could be a different hang, but ~= two days is what it takes so I'll put my datapoints here vs. a separate bug. It was a lot quicker to hang with earlier 2.6.23-pre's (Often within 5 mins, it hung in the initial gdm prompt or maybe 5 mins after login. If it survived that it was fine for a few days), with later ones it's more solid but eventually hangs. Previously the keyboard leds flashed, so it must have caught an oops. Doesn't do that anymore, so I guess even a serial console might not help (not that the box even has a serial port...). Hardware is a Thinkpad X31, ipw2100 wireless, e100 ethernet (not used), mobility radeon M6 graphics. Not suspending the box ever. So no hardware commonalities with original poster, I believe. 2.6.23-6 running now, that fails too. Trying 2.6.23.1-23 from koji next. 2.6.22 is stable for me too. CPU0 0: 18582115 XT-PIC-XT timer 1: 49019 XT-PIC-XT i8042 2: 0 XT-PIC-XT cascade 7: 2 XT-PIC-XT parport0 8: 3 XT-PIC-XT rtc 9: 50501 XT-PIC-XT acpi 11: 294389 XT-PIC-XT yenta, yenta, ehci_hcd:usb1, uhci_hcd:usb2, uhci_hcd:usb3, uhci_hcd:usb4, firewire_ohci, eth1, Intel 82801DB-ICH4 Modem, Int el 82801DB-ICH4, radeon@pci:0000:01:00.0, eth0 12: 380606 XT-PIC-XT i8042 14: 76743 XT-PIC-XT libata 15: 0 XT-PIC-XT libata NMI: 0 LOC: 0 ERR: 0 MIS: 0 00:00.0 Host bridge: Intel Corporation 82855PM Processor to I/O Controller (rev 03) 00:01.0 PCI bridge: Intel Corporation 82855PM Processor to AGP Controller (rev 03) 00:1d.0 USB Controller: Intel Corporation 82801DB/DBL/DBM (ICH4/ICH4-L/ICH4-M) USB UHCI Controller #1 (rev 01) 00:1d.1 USB Controller: Intel Corporation 82801DB/DBL/DBM (ICH4/ICH4-L/ICH4-M) USB UHCI Controller #2 (rev 01) 00:1d.2 USB Controller: Intel Corporation 82801DB/DBL/DBM (ICH4/ICH4-L/ICH4-M) USB UHCI Controller #3 (rev 01) 00:1d.7 USB Controller: Intel Corporation 82801DB/DBM (ICH4/ICH4-M) USB2 EHCI Controller (rev 01) 00:1e.0 PCI bridge: Intel Corporation 82801 Mobile PCI Bridge (rev 81) 00:1f.0 ISA bridge: Intel Corporation 82801DBM (ICH4-M) LPC Interface Bridge (rev 01) 00:1f.1 IDE interface: Intel Corporation 82801DBM (ICH4-M) IDE Controller (rev 01) 00:1f.3 SMBus: Intel Corporation 82801DB/DBL/DBM (ICH4/ICH4-L/ICH4-M) SMBus Controller (rev 01) 00:1f.5 Multimedia audio controller: Intel Corporation 82801DB/DBL/DBM (ICH4/ICH4-L/ICH4-M) AC'97 Audio Controller (rev 01) 00:1f.6 Modem: Intel Corporation 82801DB/DBL/DBM (ICH4/ICH4-L/ICH4-M) AC'97 Modem Controller (rev 01) 01:00.0 VGA compatible controller: ATI Technologies Inc Radeon Mobility M6 LY 02:00.0 CardBus bridge: Ricoh Co Ltd RL5c476 II (rev aa) 02:00.1 CardBus bridge: Ricoh Co Ltd RL5c476 II (rev aa) 02:00.2 FireWire (IEEE 1394): Ricoh Co Ltd R5C552 IEEE 1394 Controller (rev 02) 02:02.0 Network controller: Intel Corporation PRO/Wireless LAN 2100 3B Mini PCI Adapter (rev 04) 02:08.0 Ethernet controller: Intel Corporation 82801DB PRO/100 VE (MOB) Ethernet Controller (rev 81) Still broken in 2.6.23.1-23.fc8. nmi_watchdog=2 doesn't help. I suppose there's a chance that it printed a stack trace on the console, which I couldn't see because I was on m X VT rather than my console VT. I'll try to remember to switch to my console VT before walking away from my computer, but I'm concerned that even then I might not see the stack trace, because the console will screen-save before the lock-up. Do you know off the top of your head how I turn off the screen-save on a text VT? Also, the ctrl-alt-sysrq sequences are ineffective. It's *really* locked up. setterm -blank 0 should do the trick. Unscientifically seems to hang much quicker in -30 ( < 1 day), -6 actually went up to 5 days before I booted to a later kernel (but it hung at least once too, so that doesn't really say much). Running with nohz=off highres=off now. Mine's a ICH4 chipet, which the hrt3 patchset does a "Force enabled HPET at base address 0xfed00000" to (with a comment in the patch doing that the HPET is not documented to even exist on my chipset etc. and might not be safe) Also "Local APIC disabled by BIOS -- you can enable it with "lapic"" so nolapic won't help here, I'm sure. Symptoms/kernel changes would easily point to the highres timer patch. Then again, our hardwares are quite different so if forcing HPET on ICH4 is a problem, it won't explain your hangs... Easily could be different bugs too, but let's keep hints here in one bug until we figure this out a bit better? What clocksource is the system using? See the "system clock runs too fast/slow" secion of: https://fedoraproject.org/wiki/KernelCommonProblems for directions on how to change the clocksource. My kernel is using the tsc clock. I just hung with my VT set to my console and nmi_watchdog=2. No backtrace was printed to the console. Any other suggestions? (In reply to comment #17) > My kernel is using the tsc clock. > > I just hung with my VT set to my console and nmi_watchdog=2. No backtrace was > printed to the console. Any other suggestions? > Try clocksource=acpi_pm Mine was HPET (running with acpi_pm now). nohz=off highress=off test got interrupted since I had to suspend the box a few times and it failed to come out of suspend once, so nothing conclusive there. Mo hangs during use but no long test periods either... Probably should borrow an identical laptop just for testing this issue :P acpi_pm hung too after a day. Resuming nohz=off highres=off test. Well, 3.5 days of uptime with nohz=off highres=off (with HPET, which is the defalt on my laptop) so I guess it's "fixed". So I think the conclusion is that my bug is different than yours (no surprise, hardware is so different) Anything else I should try out? Just one of nohz/highres? File a separate bug like "Thinkpad X31 hangs on F8 kernel without nohz=off highres=off" ? I've been up for over five days with clocksource=acpi_pm, so perhaps that resolved the issue for me. I'll let you know if I hang. Updating version/summary to F8, hopefully will be more noticeable to people that are just installing F8 now (+rawhide is a moving target with 2.6.24rc, it might even be fixed upstream there for all I know...). Someone on #fedora-devel already needed clocksource=acpi_pm to boot at all. Running with just highres=off now. I've performed an upgrade from Fedora 7 to Fedora 8 thanks to yum using my usual steps (Cf. http://bertrandbenoit.blogspot.com/2007/09/upgrade-gnulinux-fedora.html) without any problem. Unfortunately, as soon as I have rebooted the computer, I have such hangs too. I've tried lots of your proposals without success, so I'll give you explanations for help. This is the situation I've tried, one after the other (one boot for each), without success : - with nohz=off highres=off kernel options,* - with clocksource=acpi_pm kernel option,* - with nolapic_timer kernel option,* - with my last Fedora 7 Kernel,** - while not being connected to local network,* - more time only in text (checking for potential annoying new services), then OpenOffice Writer edition.*** *: each time, my usual desktop starts (with Kate, xmms, Firefox, Thunderbird, several Konsoles), I begin to write a new mail, and after only few seconds (about 30s approximatively), the keyboard does not "answer" anymore BUT the mouse pointer can be moved during about 5 seconds approximatively before being freezed too. It were then impossible to restart X with CTRL+ALT+SUPPR, neither to move to another xterm **: this case is perhaps more interesting, because the first 30/40 seconds begins as wished (without hang), I've stressed my computer, launching a video (mplayer), musics (xmms), and web pages with video (youTube). There were absolutely no problem during about 5 minutes. As soon as, I've tried a new time to answer to a mail (with Thunderbird), and then same hang (keybord, then mouse) BUT the music (MP3) keeps on playing without any problem, and the whole music, not only a little part in loop (like someone explained before). ***: such a way, because Thunderbird seemed the "key", I've tried another steps. The exactly same desktop started, I've closed all "graphical" frames (Firefox, xmms, Thunderbird), and checked if it could not take source from new services. Under a Konsole, I've been able to "work" during 30minutes, then I've make a pause (during more than 1 hour), and the system was still alive, fine. I've then launched OpenOffice Writer to try something else, and unfortunately, after about 1/2 minutes, same issue, keybord, then mouse. Hope all those information will help. Just highres=off hung within a day, nohz=off has been fine for a couple of days, test continues still. My guess for #24's problem is video card related. What video card (and if nvidia/ati, what drivers). Could try with the vesa driver or disabling acceleration on DRI in xorg.conf. If those help, then a separate bug should be filed (under xorg-x11-drv-XXX ?). (In reply to comment #25) > Just highres=off hung within a day, nohz=off has been fine for a couple of days, > test continues still. > > My guess for #24's problem is video card related. What video card (and if > nvidia/ati, what drivers). Could try with the vesa driver or disabling > acceleration on DRI in xorg.conf. If those help, then a separate bug should be > filed (under xorg-x11-drv-XXX ?). I've an ATI Radeon X800 Pro. Before upgrade, the radeon driver was used. I've not checked what is now used (and I cannot until this evening). Anyway, there is no chance the driver has been changed during upgrade without any warning ? Unfortunately, it seems that I still have the radeon driver in use. So there was no change, and so no reason for hang. Anyway, would not be amawazing this comes from graphical divers according to all steps I've followed ? In addition, what would be the link between graphical driver and keyboard/mouse ? How could we explain the fact the mouse freezes lightly after the keyboard and not in the absolute same time ? For information, my mouse is USB connected. (In reply to comment #27) > Unfortunately, it seems that I still have the radeon driver in use. So there was > no change, and so no reason for hang. > Anyway, would not be amawazing this comes from graphical divers according to all > steps I've followed ? > In addition, what would be the link between graphical driver and keyboard/mouse > ? How could we explain the fact the mouse freezes lightly after the keyboard and > not in the absolute same time ? > For information, my mouse is USB connected. There were some radeon bugs fixed in kernel 2.6.23.1-49. Well, my guess was based on your description of more "graphical" apps causing the hangs for you, that's why I recommended trying with the vesa driver to confirm/disprove my hunch, since it doesn't seem to be kernel hardware timer code related if the nohz etc. workarounds don't help you. And since -49 has radeon bugs fixed, that's worth a try too. Despite F7 and F8 both using radeon it's possible that there's a regression in the F8 version of the radeon driver that occurs when some specific type of hardware acceleration functionality happens to be used. The mouse/keyboard hang timing is a valuable hint too. Note that the system can be in a very confused state and the mouse cursor will still work, it's pretty much all handled at the hardware level. Showing keypresses on the screen in X requires the system to be in a pretty sane state. In a related note, nohz=off has worked for > 4 days (In reply to comment #28) > (In reply to comment #27) > > Unfortunately, it seems that I still have the radeon driver in use. So there was > > no change, and so no reason for hang. > > Anyway, would not be amawazing this comes from graphical divers according to all > > steps I've followed ? > > In addition, what would be the link between graphical driver and keyboard/mouse > > ? How could we explain the fact the mouse freezes lightly after the keyboard and > > not in the absolute same time ? > > For information, my mouse is USB connected. > > There were some radeon bugs fixed in kernel 2.6.23.1-49. Unfortunately, after my first comment I've upgraded to this one (from the kernel-2.6.23.1-42.fc8) and have seen the same hang. (In reply to comment #29) > Well, my guess was based on your description of more "graphical" apps causing > the hangs for you, that's why I recommended trying with the vesa driver to > confirm/disprove my hunch, since it doesn't seem to be kernel hardware timer > code related if the nohz etc. workarounds don't help you. And since -49 has > radeon bugs fixed, that's worth a try too. Currently, I've changed to vesa driver and have no problem for now (after 10/20 minutes of use). > > Despite F7 and F8 both using radeon it's possible that there's a regression in > the F8 version of the radeon driver that occurs when some specific type of > hardware acceleration functionality happens to be used. The mouse/keyboard hang > timing is a valuable hint too. Note that the system can be in a very confused > state and the mouse cursor will still work, it's pretty much all handled at the > hardware level. Showing keypresses on the screen in X requires the system to be > in a pretty sane state. Ok, thanks for information. If it keeps on, perhaps I can temporarly use a F7 radeon driver version ? > > In a related note, nohz=off has worked for > 4 days Worth a try, rpm -Uvh --oldpackage xorg-x11-drv-ati from fc7 and see how it goes, with luck dependencies won't be an issue. In any case, I'd recommend filing a bug under xorg-x11-drv-ati (quickly looking there were a few "radeon crashes my box" bugs that might be related to your problem, so check if there are any hints/workarounds there). Including lspci / Xorg.0.log / results of trying various options from man radeon (In particular adding Option "RenderAccel" "off" or Option "DRI" "off" in xorg.conf in the Device section) and the fc7 driver may help in getting the problem resolved for good. I have tried with fc7 driver. It is better because I can use my computer during a more long time, but unfortunately one hang occurred. I have no more precision for now but it shows this temporarily "solution" is not enough. I have resolved some problems I had with pulseaudio after upgrade, do you think it could be implicated ? Heh. I now have some weird clues. 2.6.23.8-62.fc8 still hangs. 2.6.24-0.39.rc3.git1.fc9 ran fine for 4 days (I had to suspend it and it failed to come back up fully so I had to reboot, but that's a different bug in any case). .24 doesn't have the highres timer patches (yet?), does it? Another X31 ran -62 over the weekend (no load though) without any workarounds just fine. Differences are very minor, older BIOS+aironet (not ipw2100) on the one that doesn't need nohz=off. dmidecode diff attached. Created attachment 268671 [details]
Diff of dmidecodes for nonworking vs. working thinkpad.
(In reply to comment #35) > Created an attachment (id=268671) [edit] > Diff of dmidecodes for nonworking vs. working thinkpad. > Can the older BIOS be updated? Done, running exactly the same version of BIOS (3.02)/embedded controller (1.08) software now. Oh, I lied. Working one also has e1000, broken one e100 (but not used in either, NM maybe polls it but that's it). I'll let it run for 3-4 days (or until it crashes), if that succeeds I'll swap harddrives and start using this one in the same ways I've used the misbehaving one. I have a backup of the old ACPI DSDT btw., just in case it's that sort of thing... Updated BIOS+airo X31 up & running for 2.5 days, hanging ipw2100 X31 with mv ipw2100.ko ipw2100.no and ath5k used instead up for a day (2.6.23-based now), so would seem like a hrt patch + nohz + ipw2100 thing? Testing continues. Pah. It hung even with ath5k being used & ipw2100 not getting loaded, so there goes that explanation. What a mystery... Oh well, at least there are two workarounds (nohz and .24), latter might start getting affected at some point I suppose... I need nohz=off also. I've also been seeing these hangs using the F8 kernels. I never had this with the F7 kernels. I've been running with nohz=off for about a week and that seems to have fixed the issue for me. Without nohz=off my machine would hang within a day. I am using the nvidia module, so its possible that could be a factor I guess. I'd be happy to provide any lspci etc if it would be helpful. Please do, in case it helps finding some sort of hardware pattern (almost identical laptop up for two weeks now without workarounds :O ) If you can survive without the nvidia module to get at least one hang, that'd be super, then we can rule that one out. * Mon Dec 10 2007 Chuck Ebbert <cebbert> 2.6.23.9-87 - highres-timers: update to -hrt4 (#394981); includes hang fix Fri Dec 07 2007 Chuck Ebbert <cebbert> 2.6.23.9-84 - highres-timers: fix possible hang needs to be tested by both of us, I think, btw. (koji.fedoraproject.org has them if updates-testing doesn't already). #394981 looks different than our bug, the -84 thing maybe could possibly be it. Created attachment 286551 [details]
lspci output - mschout
Well, I managed to hang 2.6.23.9-85.fc8 both with and without nohz=off, so this seems even worse than -63 was for me. I've tried to run without the nvidia driver, but I can not get my display to configure at all. system-config-display fails to configure it (and makes the screen flash wildly), and X -configure fails to do it also. I think the reason is that I have a quite new display (dual Dell 2208WFP's). I even tried unplugging and just getting a single display configured with the same results. The failure to config the display is a separate issue obviously, but I can't easily run without the nvidia driver because of that at this time. I'm running with both nohz=off and highres=off now. If that hangs also I will post an update, then go back to -65 or the FC7 kernel for now. I've been running for about 4 days now with "nohz=off highres=off" and no hangs with -85. So that stops the hangs for me. Guessing there are some problems with the tickless/highres patches. New datapoints, 2.6.24-based kernels started doing the same at some point. 2.6.24-0.39.rc3.git1.fc9 worked without nohz=off, 0.150.rc7 hangs like 2.6.23. Quite a few revisions in the middle, alas :( Additional datapoint is from 2.6.24-0.115.rc5.git5.fc9, but that one is with vmware modules loaded (installed vmware, ran windows inside it for some time, stopped using vmware and it hung in 15 mins). And yet more, I started bisecting the thing. 2.6.24-2 compiled without any fedora patches up for 3 days. I'll let it run for a bit more and then whip up a new 2.6.24-2 with only some of the fedora patches. I think I'll try with just utrace first unless there are any educated guesses on where to start. _SMELLS_ like utrace. Vanilla 2.6.24-2 was happy for 5 days, booted a 2.6.24-2 with some of the Fedora patches applied and it hung after a day. I applied everything until .. %if 0 ApplyPatch linux-2.6-execshield.patch (which means utrace + some ppc patches, -mtune=generic) Next attempt: leave out utrace and apply everything else. ApplyPatch linux-2.6-utrace-tracehook.patch ApplyPatch linux-2.6-utrace-regset.patch #ApplyPatch linux-2.6-utrace-core.patch #ApplyPatch linux-2.6-utrace-ptrace-compat.patch (+ arch-specific ones below them) hangs as well. utrace commented out but everything else applied ran fine for 2.5 days or so (should have let it run longer though, sometimes it takes almost 3 days for the hang to happen, so I can't be 100% sure. Still it's either first two patches of utrace or -mtune=generic). linux-2.6-utrace-tracehook.patch was enough to trigger it. I put some RPM's of 2.6.24.2-4 with utrace commented out in http://www.ee.oulu.fi/~pp/bz283161/ so other people can test whether it's the same thing for them. Looks like the utrace included in the .24 kernels isn't the latest one from Roland, so there might be relevant fixes there too. But let's figure out whether we really can blame utrace first ;) I also pinged Roland. Scratch the utrace theory, even the noutrace kernel hung :( *continues twiddling* Well, 2.6.24.2-4 (sans utrace) actually hung even with nohz=off. Me thinks ApplyPatch linux-2.6-highres-timers.patch was that one (That's something the f8 kernel branch has and the f9 one doesn't) and utrace is still a suspect. 2.6.25-0.40.rc1.git2.fc9 is actually solid without any extra options (up for almost 5 days now). That one doesn't have a special highres patch and utrace is commented out in the standard specfile. Odd, none of the 2.6.24 kernels should have had a highres-timers patch. I just checked cvs, and kernel-2.6.24.2-11.fc8 doesn't have one. Hmn, indeed, I diffed the .25 against a 2.6.23.15-137 specfile by accident when looking for clues, not 24.2-4 (which was the hangs in any case kernel) so highres wasn't there. So nothing "obvious" between the vanilla .24 that worked and the slightly newer .24 that didn't... What a puzzle indeed :P At least rawhide now works a-ok (any bets on how long?)! ;) Problem resurfaced between kernel-2.6.25-0.136.rc6.git5.fc9.i686 (everything including this in the 2.6.25 series has been a-ok) and kernel-2.6.25-0.163.rc7.git1.fc9.i686 (~= 2 days and hang) utrace got reenabled in 0.139. Updated to 9 since it did show up on F9 kernels too. The laptop that had the problems kicked the bucket a week ago so I can no longer reproduce it. Before it died nohz=off remained a reliable workaround for 20 day+ uptimes, so it being a hardware thing before is unlikely (but possible). I'm fine with WORKSFORME. Someone on the Cc list please add a comment if still need command line things with latest F9 kernels, otherwise probably best to close and start new bugs as necessary? |