Bug 455621

Summary: 100% CPU Use in Hardware Interrupts with noapic option after updating to new kernel
Product: [Fedora] Fedora Reporter: Alex Chernyakhovsky <achernya>
Component: kernelAssignee: Kernel Maintainer List <kernel-maint>
Status: CLOSED NOTABUG QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: medium Docs Contact:
Priority: low    
Version: 9CC: macro
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2008-09-10 01:20:10 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Alex Chernyakhovsky 2008-07-16 17:24:49 UTC
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9) Gecko/2008061712 Fedora/3.0-1.fc9 Firefox/3.0

Description of problem:
I am running an HP tx2000z series laptop (a tx2117cl to be exact) with a dual-core AMD Turion 64 X2 TL-62 (2.1Ghz cores).

I had Fedora 9 with kernel-2.6.25.9-76.fc9 running perfectly, including suspend-to-ram and hibernate if I booted with the noapic option. According to top, there were no negative side effects to this, and the machine worked perfectly, aside from a few minor glitches not relevant to this bug,

However, when I updated to kernel-2.6.25.10-86.fc9.x86_64, the second CPU is always 100% in use performing hardware interrupts (hi).

Since such did not occur earlier, I can only presume that this is a bug. I cannot boot without noapic consistently. Doing so will give rise to a Machine Check Exception. However, since this machine works perfectly with noapic and in vista, I strongly doubt that there is a hardware issue.

Version-Release number of selected component (if applicable):
kernel-2.6.25.10-86.fc9.x86_64

How reproducible:
Always


Steps to Reproduce:
1. Install latest kernel update (kernel-2.6.25.10-86.fc9.x86_64) on a tx2000z series tablet
2. Attempt to boot with noapic
3. Observe 100% cpu use in one core doing hardware interrupts

Actual Results:
100% CPU use in hardware interrupts.

Expected Results:
Both cores 99% idle, being used by the processes, not the hardware.

Additional info:
This is a tablet pc, manufactured by HP. As with all other HP laptops, it requires noapic to boot properly. This is the first time I have seen CPU utilization on this tablet in doing so.

Comment 1 Chuck Ebbert 2008-07-20 19:52:28 UTC
I have a tx1000 that also needs noapic to boot. It occasionally gets into that
state where a CPU gets stuck processing interrupts but usually it stops after a
while.

F9 kernels starting with 2.6.25.11-95 have the sysrq-l key added to show a
backtrace on all processors. It can be activated by the command

  echo 'l' >/proc/sysrq-trigger

Additional debugging will hopefully be added to the next release so we can find
out why noapic is needed too.


Comment 2 Alex Chernyakhovsky 2008-07-20 20:08:20 UTC
I have checked my tx2000z laptop's F9 installation, and I only have
kernel-2.6.25.9-76.fc9.x86_64 and kernel-2.6.25.10-86.fc9.x86_64 installed. Is
2.6.25.11-95 still in rawhide?

Additionally, I see the following message shortly after boot-up or reloading the
USB modules:

Jul 20 15:57:38 localhost kernel: irq 7: nobody cared (try booting with the
"irqpoll" option)
Jul 20 15:57:38 localhost kernel: Pid: 0, comm: swapper Tainted: P        
2.6.25.9-76.fc9.x86_64 #1
Jul 20 15:57:38 localhost kernel: 
Jul 20 15:57:38 localhost kernel: Call Trace:
Jul 20 15:57:38 localhost kernel:  <IRQ>  [<ffffffff8107180f>]
__report_bad_irq+0x38/0x7c
Jul 20 15:57:38 localhost kernel:  [<ffffffff81071a35>] note_interrupt+0x1e2/0x2
49
Jul 20 15:57:38 localhost kernel:  [<ffffffff8107221b>] handle_level_irq+0xb1/0xe7
Jul 20 15:57:38 localhost kernel:  [<ffffffff8100e48f>] do_IRQ+0xf7/0x167
Jul 20 15:57:38 localhost kernel:  [<ffffffff8100aff0>] ? default_idle+0x0/0x5f
Jul 20 15:57:38 localhost kernel:  [<ffffffff8100c3f1>] ret_from_intr+0x0/0xa
Jul 20 15:57:38 localhost kernel:  <EOI>  [<ffffffff8100b029>] ?
default_idle+0x39/0x5f
Jul 20 15:57:38 localhost kernel:  [<ffffffff8100b024>] ? default_idle+0x34/0x5f
Jul 20 15:57:38 localhost kernel:  [<ffffffff8100aff0>] ? default_idle+0x0/0x5f
Jul 20 15:57:38 localhost kernel:  [<ffffffff8100afa8>] ? cpu_idle+0x78/0xc0
Jul 20 15:57:38 localhost kernel:  [<ffffffff81289e7b>] ?
start_secondary+0x3fc/0x40b
Jul 20 15:57:38 localhost kernel: 
Jul 20 15:57:38 localhost kernel: handlers:
Jul 20 15:57:38 localhost kernel: [<ffffffff811be414>] (usb_hcd_irq+0x0/0x63)
Jul 20 15:57:38 localhost kernel: Disabling IRQ #7

Perhaps it is related? Following the recommendation to boot with irqpoll (in
addition to my normal noapic) results in a deadlocked machine. 

However, USB works, as long as no new devices are plugged in.

Comment 3 Chuck Ebbert 2008-07-21 02:08:40 UTC
(In reply to comment #2)
> I have checked my tx2000z laptop's F9 installation, and I only have
> kernel-2.6.25.9-76.fc9.x86_64 and kernel-2.6.25.10-86.fc9.x86_64 installed. Is
> 2.6.25.11-95 still in rawhide?
>
It has not been built yet.

> Additionally, I see the following message shortly after boot-up or reloading the
> USB modules:
> 
> Jul 20 15:57:38 localhost kernel: irq 7: nobody cared (try booting with the
> "irqpoll" option)
> Jul 20 15:57:38 localhost kernel: Pid: 0, comm: swapper Tainted: P        
> 2.6.25.9-76.fc9.x86_64 #1
> Jul 20 15:57:38 localhost kernel: 
> Jul 20 15:57:38 localhost kernel: Call Trace:
> Jul 20 15:57:38 localhost kernel:  <IRQ>  [<ffffffff8107180f>]
> __report_bad_irq+0x38/0x7c
> Jul 20 15:57:38 localhost kernel:  [<ffffffff81071a35>] note_interrupt+0x1e2/0x2
> 49
> Jul 20 15:57:38 localhost kernel:  [<ffffffff8107221b>] handle_level_irq+0xb1/0xe7
> Jul 20 15:57:38 localhost kernel:  [<ffffffff8100e48f>] do_IRQ+0xf7/0x167
> Jul 20 15:57:38 localhost kernel:  [<ffffffff8100aff0>] ? default_idle+0x0/0x5f
> Jul 20 15:57:38 localhost kernel:  [<ffffffff8100c3f1>] ret_from_intr+0x0/0xa
> Jul 20 15:57:38 localhost kernel:  <EOI>  [<ffffffff8100b029>] ?
> default_idle+0x39/0x5f
> Jul 20 15:57:38 localhost kernel:  [<ffffffff8100b024>] ? default_idle+0x34/0x5f
> Jul 20 15:57:38 localhost kernel:  [<ffffffff8100aff0>] ? default_idle+0x0/0x5f
> Jul 20 15:57:38 localhost kernel:  [<ffffffff8100afa8>] ? cpu_idle+0x78/0xc0
> Jul 20 15:57:38 localhost kernel:  [<ffffffff81289e7b>] ?
> start_secondary+0x3fc/0x40b
> Jul 20 15:57:38 localhost kernel: 
> Jul 20 15:57:38 localhost kernel: handlers:
> Jul 20 15:57:38 localhost kernel: [<ffffffff811be414>] (usb_hcd_irq+0x0/0x63)
> Jul 20 15:57:38 localhost kernel: Disabling IRQ #7
> 
> Perhaps it is related? Following the recommendation to boot with irqpoll (in
> addition to my normal noapic) results in a deadlocked machine. 
> 
> However, USB works, as long as no new devices are plugged in.

Use 'noirqdebug' and the bogus interrupts will be ignored instead of causing errors.


Comment 4 Chuck Ebbert 2008-07-22 03:34:34 UTC
2.6.25.11-97 has been submitted to the updates-testing repository.


Comment 5 Alex Chernyakhovsky 2008-07-30 17:34:04 UTC
I have installed 2.6.25.11-97 and run the specified command. I see the following
in /var/log/messages:

Jul 30 13:33:10 localhost kernel: SysRq : Show backtrace of all active CPUs
Jul 30 13:33:10 localhost kernel: CPU1:
Jul 30 13:33:10 localhost kernel:  ffff8100bb693f18 0000000000000046
ffffffff81195b3e 0000000000000000
Jul 30 13:33:10 localhost kernel:  0000000000000001 00007f464090b760
ffff8100bb693f58 ffffffff8100d817
Jul 30 13:33:10 localhost kernel:  ffff8100bb693f78 ffffffff81195b86
ffff8100bb693f78 0000000000000000
Jul 30 13:33:10 localhost kernel: Call Trace:
Jul 30 13:33:10 localhost kernel:  <IRQ>  [<ffffffff81195b3e>] ? showacpu+0x0/0x5b
Jul 30 13:33:10 localhost kernel:  [<ffffffff8100d817>] ? show_stack+0x10/0x12
Jul 30 13:33:10 localhost kernel:  [<ffffffff81195b86>] ? showacpu+0x48/0x5b
Jul 30 13:33:10 localhost kernel:  [<ffffffff8101b0ea>] ?
smp_call_function_interrupt+0x48/0x71
Jul 30 13:33:10 localhost kernel:  [<ffffffff8100ca36>] ?
call_function_interrupt+0x66/0x70
Jul 30 13:33:10 localhost kernel:  <EOI> 

This is a boot with noapic.

Comment 6 Alex Chernyakhovsky 2008-07-30 17:40:52 UTC
Backtrace without noapic (lucky boot? unsure at this time)
Jul 30 13:39:57 localhost kernel: SysRq : Show backtrace of all active CPUs
Jul 30 13:39:57 localhost kernel: CPU0:
Jul 30 13:39:57 localhost kernel:  ffffffff814bbf18 0000000000000046
ffffffff81195b3e 0000000000000000
Jul 30 13:39:57 localhost kernel:  0000000000000020 0000000000000001
ffffffff814bbf58 ffffffff8100d817
Jul 30 13:39:57 localhost kernel:  ffffffff814bbf78 ffffffff81195b86
ffffffff814bbf88 0000000000000000
Jul 30 13:39:57 localhost kernel: Call Trace:
Jul 30 13:39:57 localhost kernel:  <IRQ>  [<ffffffff81195b3e>] ? showacpu+0x0/0x5b
Jul 30 13:39:57 localhost kernel:  [<ffffffff8100d817>] ? show_stack+0x10/0x12
Jul 30 13:39:57 localhost kernel:  [<ffffffff81195b86>] ? showacpu+0x48/0x5b
Jul 30 13:39:57 localhost kernel:  [<ffffffff8101b0ea>] ?
smp_call_function_interrupt+0x48/0x71
Jul 30 13:39:57 localhost kernel:  [<ffffffff8100ca36>] ?
call_function_interrupt+0x66/0x70
Jul 30 13:39:57 localhost kernel:  <EOI>  [<ffffffff810cc63c>] ?
inotify_inode_queue_event+0x34/0xd9
Jul 30 13:39:57 localhost kernel:  [<ffffffff8110149b>] ?
security_file_permission+0x11/0x13
Jul 30 13:39:57 localhost kernel:  [<ffffffff810a43d6>] ?
do_readv_writev+0x17e/0x193
Jul 30 13:39:57 localhost kernel:  [<ffffffff8106c6c3>] ?
audit_syscall_entry+0x126/0x15a
Jul 30 13:39:57 localhost kernel:  [<ffffffff8106c394>] ?
audit_syscall_exit+0x331/0x353
Jul 30 13:39:57 localhost kernel:  [<ffffffff810a4429>] ? vfs_writev+0x3e/0x49
Jul 30 13:39:57 localhost kernel:  [<ffffffff810a447b>] ? sys_writev+0x47/0x94
Jul 30 13:39:57 localhost kernel:  [<ffffffff8100c052>] ? tracesys+0xd5/0xda
Jul 30 13:39:57 localhost kernel: 


Comment 7 Chuck Ebbert 2008-08-01 20:23:57 UTC
The computer should work fine without noapic with kernels 2.6.25 and later. It
was only older kernels that needed that option. Check the boot messages and look
for a line containing " using 0xed I/O delay port". If that is present IOAPIC
mode should just work.


Comment 8 Alex Chernyakhovsky 2008-08-01 20:43:04 UTC
Unfortunately, I cannot get the tx2000z to reliably boot without noapic in any
kernel. I just tried, and received a Machine Check Exception (the whole reason I
had to use noapic in the first place):

HARDWARE ERROR
CPU 0: Machine Check Exception 4 Bank 4: b2000000000f0f
TSC 11e4deb697

The kernel claims that this is a hardware issue, but since Windows Vista can
survive, as can a noapic'd kernel, I strongly doubt the kernel's claim.

Furthermore, when I _do_ get the tx2000z to boot without noapic, it can't wake
up from suspend-to-ram. When I run 2.6.25.9-76 with noapic, everything works
properly (with the exception of an irq 7: nobody cared).

Comment 9 Alex Chernyakhovsky 2008-08-02 01:47:38 UTC
After some playing with options, including re-installing kmod-nvidia from livna,
I found that 2.6.25-11.97 performs the same way as the earlier kernels, with the
exception that the second core is stuck at 100% Hardware interrupts until IRQ 7
nobody cared occurs. (This uses the noapic option).

Comment 10 Chuck Ebbert 2008-08-12 02:17:55 UTC
(In reply to comment #8)
> Unfortunately, I cannot get the tx2000z to reliably boot without noapic in any
> kernel. I just tried, and received a Machine Check Exception (the whole reason I
> had to use noapic in the first place):
> 
> HARDWARE ERROR
> CPU 0: Machine Check Exception 4 Bank 4: b2000000000f0f
> TSC 11e4deb697
> 
> The kernel claims that this is a hardware issue, but since Windows Vista can
> survive, as can a noapic'd kernel, I strongly doubt the kernel's claim.
> 
> Furthermore, when I _do_ get the tx2000z to boot without noapic, it can't wake
> up from suspend-to-ram. When I run 2.6.25.9-76 with noapic, everything works
> properly (with the exception of an irq 7: nobody cared).

Does it print the line containing " using 0xed I/O delay port" ?

If not, try booting with the option "io_delay=0xed"

Comment 11 Alex Chernyakhovsky 2008-08-12 17:53:38 UTC
I have checked the boot logs. There is no mention of 0xed at all. I continue to get an MCE if I boot without any options.

However, adding  io_delay=0xed seems to fix that. There is still nothing printed, but no MCE, and I am able to suspend to ram. It has survived a few boot ups, and seems to be functioning correctly.

Comment 12 Alex Chernyakhovsky 2008-08-12 17:58:53 UTC
Also, I found out that the 100% CPU utilization (under noapic) seems to be caused  by the USB bus. Once the IRQ 7 nobody cared occurs, the (spurious?) USB interrupts stop, and the CPU utilization would return to idle. 

Furthermore, the io_delay has cleared up the IRQ 7 nobody cared and reduced the total wakeups (as shown with powertop). It is now between 500 and 600 (wake ups per 10 seconds) , depending on use, while earlier, it would be at least 1500 wake ups (per 10 seconds). This is a massive improvement.