Red Hat Bugzilla – Bug 299031
[cpuidle] kernel freeze
Last modified: 2007-11-30 17:12:16 EST
Description of problem:
Version-Release number of selected component (if applicable):
Any kernel version released up to the time this bug was filed.
Steps to Reproduce:
1. Install F8-Test2
2. Turn on.
The system freezes up
The system does not freeze
Freezes at point where displays the following:
PCI: If a device doesn't work, try "pci=routeirq". If it helps, post a report
--this suggestion doesn't help since it is not related.
Does not pass this point until the POWER button is tapped. It then continues,
but will again freeze after "starting udev". Pressing keys on keyboard will
prompt it to continue until X starts. Again, continuously freezes unless keys
are pressed or mouse is moved.
Acer Aspire 9300
CPU: AMD Turion X2-TL50
Add kernel parameter "idle=poll". Should I really have to poll my CPU on a laptop??
kernel parameter "nohz=off" does nothing for this problem, so it may not be
related to dynamic tics.
"idle=poll" will make the CPU run too hot. And "idle=halt" either doesn't work
or was removed as an option, apparently.
Can you try a vanilla kernel to see if this is caused by Fedora-specific patches
Yes, that kernel appears to be working.
Vanilla kernel works, our kernel hangs unless we have idle=poll. And our cpuidle
patch is mixed in with the highres-timers patch. What we have in there doesn't
match what's in the ACPI git tree for cpuidle, either.
I'm hunting a regression. Will publish a new -hrt queue with a bunch of fixes
0.202.rc8 still hangs on boot without idle=poll.
Just sent out a patch to LKML which is addressing a AMD X2 problem.
Kernel option "noapictimer" works.
Apparently this is the same as "nolapic_timer" on i386??
Mine still hangs on boot with the c1e patch applied and kernel options:
nolapic nohz=off highres=off
(pressing power still makes it continue)
Freezes randomly without 'nolapic' (solid lockup.)
Does not hang at boot with:
nolapic nohz=off highres=off noapictimer
Clock interrupts seem to be sent all over the place, even to IRQ7 (until
disabled as spurious)
0: 392541 11375 XT-PIC-XT timer
1: 0 233 IO-APIC-edge i8042
2: 0 0 XT-PIC-XT cascade
5: 8 5582 IO-APIC-edge sata_nv
7: 9403 90597 IO-APIC-edge ehci_hcd:usb1
8: 0 1 IO-APIC-edge rtc
9: 0 348 IO-APIC-edge acpi
10: 1 177 IO-APIC-edge HDA Intel
11: 23 38555 IO-APIC-edge ohci_hcd:usb2, eth0
12: 0 879 IO-APIC-edge i8042
14: 4 1491 IO-APIC-edge libata
15: 0 0 IO-APIC-edge libata
NMI: 0 0
LOC: 11367 392248
$ cat /proc/cmdline
ro root=LABEL=/ nohz=off highres=off acpi_irq_nobalance nolapic noapictimer
Can you provide a boot log of mainline and the above kernel please ?
(In reply to comment #10)
> Can you provide a boot log of mainline and the above kernel please ?
Building kernel-vanilla now. Any specific options to boot with?
Standard FC8 config is fine. I look into it tomorrow.
Created attachment 208991 [details]
dmesg from kernel 0.211 with noapictimer option
Kernel 0.211 has the fixed AMD c1e patch applied (disable_apic_timer is
__cpuinitdata, not __initdata. Even with noapictimer, the local apic interrupt
on CPU 0 is incrementing.
*** Bug 237325 has been marked as a duplicate of this bug. ***
When booting on battery, system stalls unless I keep pressing keys on the
keyboard. This does not happen when plugged in. Kernel is 0.214, from F8T3.
Boot options: noapic noirqdebug
*** Bug 316811 has been marked as a duplicate of this bug. ***
This problem appears to be solved in 0.222.
Not fixed in kernel 2.6.23-4 here (with new highres-timers patch.) Still using
"noapic noirqdebug" -- kernel hangs and cooling fan spins up until power switch
is pressed, then bootup continues normally. Haven't tried booting on battery yet...
My ntb works with "noapictimer" with the latest rawhide kernel (0.224) without
problems. Without it it hangs no matter if it's started on battery or not. After
pressing power button the bootup continues, but later when some services are
started, I need to press some buttons (like shift etc. or move with the mouse
after gpm is started) to stop hanging.
Chuck, is nolapictimer working for you as well ?
Comment #7 https://bugzilla.redhat.com/show_bug.cgi?id=299031#c7 says it worked
for you before. If it works, I can provide some debug patch, which allows us to
decode that problem better.
ok, I stared long enough at the code and I found the reason, why the AMD C1E
detection works on 32bit and not on 64bit. It's not trivial to fix, but I have a
solution in mind already. Patch will follow ASAP.
Created attachment 226751 [details]
AMD c1e detection fix
Chuck, does this fix your problem ?
(In reply to comment #22)
> Created an attachment (id=226751) 
> AMD c1e detection fix
> Chuck, does this fix your problem ?
It doesn't hang at boot any more, but now the tickless mode seems to be disabled.
Created attachment 227881 [details]
nvidia timer override patch
This patch (on top of the new C1E patch) fixes some of my problems with this
nVidia C51/MCP51 chipset x86_64 machine. No messages about c1e print anymore
and it hangs on boot even with those last c1e fixes, but the spurious
interrupts are all gone now. Somehow this patch is making the c1e detection
code get skipped, I guess?
I have this patch in my pile of crap already. I take a look at it right now.
The tickless mode is disabled due to the C1E detection. Sorry, that's the
fallout from broken hardware. We could get away with permanent broadcasting
though, but this makes only sense, when we have a working HPET in the system.
PIT is so slow to program and for tickless it is just useless due to the small
max. next event delta.
(In reply to comment #25)
> The tickless mode is disabled due to the C1E detection. Sorry, that's the
> fallout from broken hardware. We could get away with permanent broadcasting
> though, but this makes only sense, when we have a working HPET in the system.
Most new systems have one, I know this one does. (3 channels, 32 bit)
Yeah, I know. I hope that Venki will come up the per cpu HPET code soon, so we
can utilize HPET really instead of just broadcasting.
I *think* I can see what is happening with the latest c1e detection code. We've
set up local APIC timer interrupts on CPU0 before detecting the problem and
apparently aren't tearing them down properly, causing them to be both broadcast
and fired by the hardware. If I force noapictimer on the command line with the
latest F8 code, everything works fine: CPU1 has 50000 timer interrupts and CPU0
has 50000 local apic interrupts.
So now instead of "noapic noirqdebug" I have to use "noapictimer
acpi_use_timer_override" (the nVidia quirk code from Andi apparently doesn't work)
Fixed by the -hrt3 patch.