Bug 299031

Summary: [cpuidle] kernel freeze
Product: [Fedora] Fedora Reporter: Adam Serbinski <serbinski>
Component: kernelAssignee: Kernel Maintainer List <kernel-maint>
Status: CLOSED RAWHIDE QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: high Docs Contact:
Priority: low    
Version: 8CC: misek, mishu, stephen.kent.phillips, tglx, v.p.andronache
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2007-10-17 19:15:58 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 184121    
Attachments:
Description Flags
dmesg from kernel 0.211 with noapictimer option
none
AMD c1e detection fix
none
nvidia timer override patch none

Description Adam Serbinski 2007-09-20 19:19:47 UTC
Description of problem:
Kernel freezes

Version-Release number of selected component (if applicable):
Any kernel version released up to the time this bug was filed.

How reproducible:
Always

Steps to Reproduce:
1. Install F8-Test2
2. Turn on.
3.
  
Actual results:
The system freezes up

Expected results:
The system does not freeze

Additional info:
Freezes at point where displays the following:
PCI: If a device doesn't work, try "pci=routeirq". If it helps, post a report
--this suggestion doesn't help since it is not related.

Does not pass this point until the POWER button is tapped. It then continues,
but will again freeze after "starting udev". Pressing keys on keyboard will
prompt it to continue until X starts. Again, continuously freezes unless keys
are pressed or mouse is moved.

System:
Acer Aspire 9300
CPU: AMD Turion X2-TL50

Workaround:
Add kernel parameter "idle=poll". Should I really have to poll my CPU on a laptop??

More info:
kernel parameter "nohz=off" does nothing for this problem, so it may not be
related to dynamic tics.

Comment 1 Chuck Ebbert 2007-09-20 19:55:07 UTC
"idle=poll" will make the CPU run too hot. And "idle=halt" either doesn't work
or was removed as an option, apparently.
Can you try a vanilla kernel to see if this is caused by Fedora-specific patches
(probably cpuidle)?

http://people.redhat.com/cebbert/kernels/F8/x86_64/kernel-vanilla-2.6.23-0.185.rc6.git7.V.x86_64.rpm



Comment 2 Adam Serbinski 2007-09-20 20:35:37 UTC
Yes, that kernel appears to be working.

Comment 3 Chuck Ebbert 2007-09-21 23:43:21 UTC
Vanilla kernel works, our kernel hangs unless we have idle=poll. And our cpuidle
patch is mixed in with the highres-timers patch. What we have in there doesn't
match what's in the ACPI git tree for cpuidle, either.

Comment 4 Thomas Gleixner 2007-09-22 05:39:37 UTC
I'm hunting a regression. Will publish a new -hrt queue with a bunch of fixes
later today.


Comment 5 Chuck Ebbert 2007-09-25 19:26:33 UTC
0.202.rc8 still hangs on boot without idle=poll.

Comment 6 Thomas Gleixner 2007-09-25 19:50:10 UTC
Just sent out a patch to LKML which is addressing a AMD X2 problem.

http://lkml.org/lkml/2007/9/25/343


Comment 7 Chuck Ebbert 2007-09-25 19:51:47 UTC
Kernel option "noapictimer" works.
Apparently this is the same as "nolapic_timer" on i386??

Comment 8 Thomas Gleixner 2007-09-25 19:59:29 UTC
yes


Comment 9 Chuck Ebbert 2007-09-26 22:26:39 UTC
Mine still hangs on boot with the c1e patch applied and kernel options:
  nolapic nohz=off highres=off
(pressing power still makes it continue)

Freezes randomly without 'nolapic' (solid lockup.)
Does not hang at boot with:
  nolapic nohz=off highres=off noapictimer

Clock interrupts seem to be sent all over the place, even to IRQ7 (until
disabled as spurious)

           CPU0       CPU1
  0:     392541      11375    XT-PIC-XT        timer
  1:          0        233   IO-APIC-edge      i8042
  2:          0          0    XT-PIC-XT        cascade
  5:          8       5582   IO-APIC-edge      sata_nv
  7:       9403      90597   IO-APIC-edge      ehci_hcd:usb1
  8:          0          1   IO-APIC-edge      rtc
  9:          0        348   IO-APIC-edge      acpi
 10:          1        177   IO-APIC-edge      HDA Intel
 11:         23      38555   IO-APIC-edge      ohci_hcd:usb2, eth0
 12:          0        879   IO-APIC-edge      i8042
 14:          4       1491   IO-APIC-edge      libata
 15:          0          0   IO-APIC-edge      libata
NMI:          0          0
LOC:      11367     392248
ERR:          0

$ cat /proc/cmdline
ro root=LABEL=/ nohz=off highres=off acpi_irq_nobalance nolapic noapictimer



Comment 10 Thomas Gleixner 2007-09-26 22:41:04 UTC
Can you provide a boot log of mainline and the above kernel please ?


Comment 11 Chuck Ebbert 2007-09-26 23:27:22 UTC
(In reply to comment #10)
> Can you provide a boot log of mainline and the above kernel please ?
> 

Building kernel-vanilla now. Any specific options to boot with?



Comment 12 Thomas Gleixner 2007-09-26 23:39:04 UTC
Standard FC8 config is fine. I look into it tomorrow.



Comment 13 Chuck Ebbert 2007-09-27 20:47:09 UTC
Created attachment 208991 [details]
dmesg from kernel 0.211 with noapictimer option

Kernel 0.211 has the fixed AMD c1e patch applied (disable_apic_timer is
__cpuinitdata, not __initdata. Even with noapictimer, the local apic interrupt
on CPU 0 is incrementing.

Comment 14 Chuck Ebbert 2007-10-03 15:09:13 UTC
*** Bug 237325 has been marked as a duplicate of this bug. ***

Comment 15 Chuck Ebbert 2007-10-03 15:11:54 UTC
When booting on battery, system stalls unless I keep pressing keys on the
keyboard. This does not happen when plugged in. Kernel is 0.214, from F8T3.

Boot options: noapic noirqdebug

Comment 16 Chuck Ebbert 2007-10-03 15:51:32 UTC
*** Bug 316811 has been marked as a duplicate of this bug. ***

Comment 17 Adam Serbinski 2007-10-08 17:20:26 UTC
This problem appears to be solved in 0.222.

Comment 18 Chuck Ebbert 2007-10-10 22:45:54 UTC
Not fixed in kernel 2.6.23-4 here (with new highres-timers patch.) Still using
"noapic noirqdebug" -- kernel hangs and cooling fan spins up until power switch
is pressed, then bootup continues normally. Haven't tried booting on battery yet...


Comment 19 Vaclav "sHINOBI" Misek 2007-10-11 08:18:38 UTC
My ntb works with "noapictimer" with the latest rawhide kernel (0.224) without
problems. Without it it hangs no matter if it's started on battery or not. After
pressing power button the bootup continues, but later when some services are
started, I need to press some buttons (like shift etc. or move with the mouse
after gpm is started) to stop hanging.

Comment 20 Thomas Gleixner 2007-10-13 15:13:56 UTC
Chuck, is nolapictimer working for you as well ? 

Comment #7 https://bugzilla.redhat.com/show_bug.cgi?id=299031#c7 says it worked
for you before. If it works, I can provide some debug patch, which allows us to
decode that problem better.


Comment 21 Thomas Gleixner 2007-10-13 18:46:43 UTC
ok, I stared long enough at the code and I found the reason, why the AMD C1E
detection works on 32bit and not on 64bit. It's not trivial to fix, but I have a
solution in mind already. Patch will follow ASAP.

Comment 22 Thomas Gleixner 2007-10-14 21:06:55 UTC
Created attachment 226751 [details]
AMD c1e detection fix

Chuck, does this fix your problem ?

Comment 23 Chuck Ebbert 2007-10-15 17:26:45 UTC
(In reply to comment #22)
> Created an attachment (id=226751) [edit]
> AMD c1e detection fix
> 
> Chuck, does this fix your problem ?
> 

It doesn't hang at boot any more, but now the tickless mode seems to be disabled.


Comment 24 Chuck Ebbert 2007-10-15 19:58:01 UTC
Created attachment 227881 [details]
nvidia timer override patch

This patch (on top of the new C1E patch) fixes some of my problems with this
nVidia C51/MCP51 chipset x86_64 machine. No messages about c1e print anymore
and it hangs on boot even with those last c1e fixes, but the spurious
interrupts are all gone now. Somehow this patch is making the c1e detection
code get skipped, I guess?

Comment 25 Thomas Gleixner 2007-10-15 21:33:48 UTC
I have this patch in my pile of crap already. I take a look at it right now.

The tickless mode is disabled due to the C1E detection. Sorry, that's the
fallout from broken hardware. We could get away with permanent broadcasting
though, but this makes only sense, when we have a working HPET in the system.
PIT is so slow to program and for tickless it is just useless due to the small
max. next event delta.




Comment 26 Chuck Ebbert 2007-10-16 17:25:13 UTC
(In reply to comment #25)
> The tickless mode is disabled due to the C1E detection. Sorry, that's the
> fallout from broken hardware. We could get away with permanent broadcasting
> though, but this makes only sense, when we have a working HPET in the system.

Most new systems have one, I know this one does. (3 channels, 32 bit)



Comment 27 Thomas Gleixner 2007-10-16 18:29:48 UTC
Yeah, I know. I hope that Venki will come up the per cpu HPET code soon, so we
can utilize HPET really instead of just broadcasting.


Comment 28 Chuck Ebbert 2007-10-16 21:17:57 UTC
I *think* I can see what is happening with the latest c1e detection code. We've
set up local APIC timer interrupts on CPU0 before detecting the problem and
apparently aren't tearing them down properly, causing them to be both broadcast
and fired by the hardware. If I force noapictimer on the command line with the
latest F8 code, everything works fine: CPU1 has 50000 timer interrupts and CPU0
has 50000 local apic interrupts.

So now instead of "noapic noirqdebug" I have to use "noapictimer
acpi_use_timer_override" (the nVidia quirk code from Andi apparently doesn't work)


Comment 29 Chuck Ebbert 2007-10-17 19:15:58 UTC
Fixed by the -hrt3 patch.