299031 – [cpuidle] kernel freeze

Bug 299031 - [cpuidle] kernel freeze

Summary: [cpuidle] kernel freeze

Keywords:
Status:	CLOSED RAWHIDE
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	kernel
Sub Component:
Version:	8
Hardware:	x86_64
OS:	Linux
Priority:	low
Severity:	high
Target Milestone:	---
Assignee:	Kernel Maintainer List
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Duplicates (2):	237325 316811 (view as bug list)
Depends On:
Blocks:	F8KernelBlocker
TreeView+	depends on / blocked

Reported:	2007-09-20 19:19 UTC by Adam Serbinski
Modified:	2007-11-30 22:12 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2007-10-17 19:15:58 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
dmesg from kernel 0.211 with noapictimer option (26.14 KB, text/plain) 2007-09-27 20:47 UTC, Chuck Ebbert	no flags	Details
AMD c1e detection fix (5.00 KB, patch) 2007-10-14 21:06 UTC, Thomas Gleixner	no flags	Details \| Diff
nvidia timer override patch (4.38 KB, patch) 2007-10-15 19:58 UTC, Chuck Ebbert	no flags	Details \| Diff
View All

Description Adam Serbinski 2007-09-20 19:19:47 UTC

Description of problem:
Kernel freezes

Version-Release number of selected component (if applicable):
Any kernel version released up to the time this bug was filed.

How reproducible:
Always

Steps to Reproduce:
1. Install F8-Test2
2. Turn on.
3.
  
Actual results:
The system freezes up

Expected results:
The system does not freeze

Additional info:
Freezes at point where displays the following:
PCI: If a device doesn't work, try "pci=routeirq". If it helps, post a report
--this suggestion doesn't help since it is not related.

Does not pass this point until the POWER button is tapped. It then continues,
but will again freeze after "starting udev". Pressing keys on keyboard will
prompt it to continue until X starts. Again, continuously freezes unless keys
are pressed or mouse is moved.

System:
Acer Aspire 9300
CPU: AMD Turion X2-TL50

Workaround:
Add kernel parameter "idle=poll". Should I really have to poll my CPU on a laptop??

More info:
kernel parameter "nohz=off" does nothing for this problem, so it may not be
related to dynamic tics.

Comment 1 Chuck Ebbert 2007-09-20 19:55:07 UTC

"idle=poll" will make the CPU run too hot. And "idle=halt" either doesn't work
or was removed as an option, apparently.
Can you try a vanilla kernel to see if this is caused by Fedora-specific patches
(probably cpuidle)?

http://people.redhat.com/cebbert/kernels/F8/x86_64/kernel-vanilla-2.6.23-0.185.rc6.git7.V.x86_64.rpm

Comment 2 Adam Serbinski 2007-09-20 20:35:37 UTC

Yes, that kernel appears to be working.

Comment 3 Chuck Ebbert 2007-09-21 23:43:21 UTC

Vanilla kernel works, our kernel hangs unless we have idle=poll. And our cpuidle
patch is mixed in with the highres-timers patch. What we have in there doesn't
match what's in the ACPI git tree for cpuidle, either.

Comment 4 Thomas Gleixner 2007-09-22 05:39:37 UTC

I'm hunting a regression. Will publish a new -hrt queue with a bunch of fixes
later today.

Comment 5 Chuck Ebbert 2007-09-25 19:26:33 UTC

0.202.rc8 still hangs on boot without idle=poll.

Comment 6 Thomas Gleixner 2007-09-25 19:50:10 UTC

Just sent out a patch to LKML which is addressing a AMD X2 problem.

http://lkml.org/lkml/2007/9/25/343

Comment 7 Chuck Ebbert 2007-09-25 19:51:47 UTC

Kernel option "noapictimer" works.
Apparently this is the same as "nolapic_timer" on i386??

Comment 8 Thomas Gleixner 2007-09-25 19:59:29 UTC

yes

Comment 9 Chuck Ebbert 2007-09-26 22:26:39 UTC

Mine still hangs on boot with the c1e patch applied and kernel options:
  nolapic nohz=off highres=off
(pressing power still makes it continue)

Freezes randomly without 'nolapic' (solid lockup.)
Does not hang at boot with:
  nolapic nohz=off highres=off noapictimer

Clock interrupts seem to be sent all over the place, even to IRQ7 (until
disabled as spurious)

           CPU0       CPU1
  0:     392541      11375    XT-PIC-XT        timer
  1:          0        233   IO-APIC-edge      i8042
  2:          0          0    XT-PIC-XT        cascade
  5:          8       5582   IO-APIC-edge      sata_nv
  7:       9403      90597   IO-APIC-edge      ehci_hcd:usb1
  8:          0          1   IO-APIC-edge      rtc
  9:          0        348   IO-APIC-edge      acpi
 10:          1        177   IO-APIC-edge      HDA Intel
 11:         23      38555   IO-APIC-edge      ohci_hcd:usb2, eth0
 12:          0        879   IO-APIC-edge      i8042
 14:          4       1491   IO-APIC-edge      libata
 15:          0          0   IO-APIC-edge      libata
NMI:          0          0
LOC:      11367     392248
ERR:          0

$ cat /proc/cmdline
ro root=LABEL=/ nohz=off highres=off acpi_irq_nobalance nolapic noapictimer

Comment 10 Thomas Gleixner 2007-09-26 22:41:04 UTC

Can you provide a boot log of mainline and the above kernel please ?

Comment 11 Chuck Ebbert 2007-09-26 23:27:22 UTC

(In reply to comment #10)
> Can you provide a boot log of mainline and the above kernel please ?
> 

Building kernel-vanilla now. Any specific options to boot with?

Comment 12 Thomas Gleixner 2007-09-26 23:39:04 UTC

Standard FC8 config is fine. I look into it tomorrow.

Comment 13 Chuck Ebbert 2007-09-27 20:47:09 UTC

Created attachment 208991 [details]
dmesg from kernel 0.211 with noapictimer option

Kernel 0.211 has the fixed AMD c1e patch applied (disable_apic_timer is
__cpuinitdata, not __initdata. Even with noapictimer, the local apic interrupt
on CPU 0 is incrementing.

Comment 14 Chuck Ebbert 2007-10-03 15:09:13 UTC

*** Bug 237325 has been marked as a duplicate of this bug. ***

Comment 15 Chuck Ebbert 2007-10-03 15:11:54 UTC

When booting on battery, system stalls unless I keep pressing keys on the
keyboard. This does not happen when plugged in. Kernel is 0.214, from F8T3.

Boot options: noapic noirqdebug

Comment 16 Chuck Ebbert 2007-10-03 15:51:32 UTC

*** Bug 316811 has been marked as a duplicate of this bug. ***

Comment 17 Adam Serbinski 2007-10-08 17:20:26 UTC

This problem appears to be solved in 0.222.

Comment 18 Chuck Ebbert 2007-10-10 22:45:54 UTC

Not fixed in kernel 2.6.23-4 here (with new highres-timers patch.) Still using
"noapic noirqdebug" -- kernel hangs and cooling fan spins up until power switch
is pressed, then bootup continues normally. Haven't tried booting on battery yet...

Comment 19 Vaclav "sHINOBI" Misek 2007-10-11 08:18:38 UTC

My ntb works with "noapictimer" with the latest rawhide kernel (0.224) without
problems. Without it it hangs no matter if it's started on battery or not. After
pressing power button the bootup continues, but later when some services are
started, I need to press some buttons (like shift etc. or move with the mouse
after gpm is started) to stop hanging.

Comment 20 Thomas Gleixner 2007-10-13 15:13:56 UTC

Chuck, is nolapictimer working for you as well ? 

Comment #7 https://bugzilla.redhat.com/show_bug.cgi?id=299031#c7 says it worked
for you before. If it works, I can provide some debug patch, which allows us to
decode that problem better.

Comment 21 Thomas Gleixner 2007-10-13 18:46:43 UTC

ok, I stared long enough at the code and I found the reason, why the AMD C1E
detection works on 32bit and not on 64bit. It's not trivial to fix, but I have a
solution in mind already. Patch will follow ASAP.

Comment 22 Thomas Gleixner 2007-10-14 21:06:55 UTC

Created attachment 226751 [details]
AMD c1e detection fix

Chuck, does this fix your problem ?

Comment 23 Chuck Ebbert 2007-10-15 17:26:45 UTC

(In reply to comment #22)
> Created an attachment (id=226751) [edit]
> AMD c1e detection fix
> 
> Chuck, does this fix your problem ?
> 

It doesn't hang at boot any more, but now the tickless mode seems to be disabled.

Comment 24 Chuck Ebbert 2007-10-15 19:58:01 UTC

Created attachment 227881 [details]
nvidia timer override patch

This patch (on top of the new C1E patch) fixes some of my problems with this
nVidia C51/MCP51 chipset x86_64 machine. No messages about c1e print anymore
and it hangs on boot even with those last c1e fixes, but the spurious
interrupts are all gone now. Somehow this patch is making the c1e detection
code get skipped, I guess?

Comment 25 Thomas Gleixner 2007-10-15 21:33:48 UTC

I have this patch in my pile of crap already. I take a look at it right now.

The tickless mode is disabled due to the C1E detection. Sorry, that's the
fallout from broken hardware. We could get away with permanent broadcasting
though, but this makes only sense, when we have a working HPET in the system.
PIT is so slow to program and for tickless it is just useless due to the small
max. next event delta.

Comment 26 Chuck Ebbert 2007-10-16 17:25:13 UTC

(In reply to comment #25)
> The tickless mode is disabled due to the C1E detection. Sorry, that's the
> fallout from broken hardware. We could get away with permanent broadcasting
> though, but this makes only sense, when we have a working HPET in the system.

Most new systems have one, I know this one does. (3 channels, 32 bit)

Comment 27 Thomas Gleixner 2007-10-16 18:29:48 UTC

Yeah, I know. I hope that Venki will come up the per cpu HPET code soon, so we
can utilize HPET really instead of just broadcasting.

Comment 28 Chuck Ebbert 2007-10-16 21:17:57 UTC

I *think* I can see what is happening with the latest c1e detection code. We've
set up local APIC timer interrupts on CPU0 before detecting the problem and
apparently aren't tearing them down properly, causing them to be both broadcast
and fired by the hardware. If I force noapictimer on the command line with the
latest F8 code, everything works fine: CPU1 has 50000 timer interrupts and CPU0
has 50000 local apic interrupts.

So now instead of "noapic noirqdebug" I have to use "noapictimer
acpi_use_timer_override" (the nVidia quirk code from Andi apparently doesn't work)

Comment 29 Chuck Ebbert 2007-10-17 19:15:58 UTC

Fixed by the -hrt3 patch.

Note You need to log in before you can comment on or make changes to this bug.