643661 – kernel: BUG: soft lockup - CPU#3 stuck for 61s! [ksoftirqd/3:13]

Bug 643661 - kernel: BUG: soft lockup - CPU#3 stuck for 61s! [ksoftirqd/3:13]

Summary: kernel: BUG: soft lockup - CPU#3 stuck for 61s! [ksoftirqd/3:13]

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	kernel
Sub Component:
Version:	14
Hardware:	All
OS:	Linux
Priority:	low
Severity:	medium
Target Milestone:	---
Assignee:	Kernel Maintainer List
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2010-10-16 22:21 UTC by Nathan G. Grennan
Modified:	2012-08-16 18:41 UTC (History)
CC List:	9 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2012-08-16 18:41:48 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Nathan G. Grennan 2010-10-16 22:21:27 UTC

Description of problem:
  Seems to be an issue with IRQ code and is triggered by some cron job, from the timing. My first guess would be updatedb, since it hits the hard drive, and one of the modules attached to IRQ 16 is ahci. I ran 2.6.35.4-25.fc14.x86_64 for a while before this without issue. So I am going to try going back to it for now. I may also try 2.6.36 if I can get it working.

  Yes, I know it is tainted by nvidia, but my computer is not useful without it.

Oct 11 04:07:35 proton kernel: irq 16: nobody cared (try booting with the "irqpoll" option)
Oct 11 04:07:35 proton kernel: Pid: 0, comm: swapper Tainted: P            2.6.35.5-29.fc14.x86_64 #1
Oct 11 04:07:35 proton kernel: Call Trace:
Oct 11 04:07:35 proton kernel: <IRQ>  [<ffffffff810a64f4>] __report_bad_irq+0x3d/0x8c
Oct 11 04:07:35 proton kernel: [<ffffffff810a665b>] note_interrupt+0x118/0x17d
Oct 11 04:07:35 proton kernel: [<ffffffff810a6e35>] handle_fasteoi_irq+0xa8/0xce
Oct 11 04:07:35 proton kernel: [<ffffffff8100c28f>] handle_irq+0x88/0x91
Oct 11 04:07:35 proton kernel: [<ffffffff8146d3a4>] do_IRQ+0x5c/0xc3
Oct 11 04:07:35 proton kernel: [<ffffffff81467613>] ret_from_intr+0x0/0x11
Oct 11 04:07:35 proton kernel: <EOI>  [<ffffffff8101172d>] ? mwait_idle+0x7a/0x87
Oct 11 04:07:35 proton kernel: [<ffffffff810116df>] ? mwait_idle+0x2c/0x87
Oct 11 04:07:35 proton kernel: [<ffffffff81008c1f>] cpu_idle+0xaa/0xe4
Oct 11 04:07:35 proton kernel: [<ffffffff8145fcf7>] start_secondary+0x253/0x294
Oct 11 04:07:35 proton kernel: handlers:
Oct 11 04:07:35 proton kernel: [<ffffffff81317190>] (ahci_interrupt+0x0/0x5f4)
Oct 11 04:07:35 proton kernel: [<ffffffff8133113c>] (usb_hcd_irq+0x0/0x7b)
Oct 11 04:07:35 proton kernel: [<ffffffffa05ee451>] (nv_kern_isr+0x0/0x5e [nvidia])
Oct 11 04:07:35 proton kernel: Disabling IRQ #16
Oct 11 04:07:42 proton kernel: NVRM: Xid (0001:00): 16, Head 00000000 Count 0004511b
Oct 11 04:07:42 proton kernel: NVRM: Xid (0001:00): 16, Head 00000001 Count 0002add3
Oct 11 04:07:50 proton kernel: NVRM: Xid (0001:00): 8, Channel 0000007f
Oct 11 04:07:56 proton kernel: connection1:0: ping timeout of 5 secs expired, recv timeout 5, last rx 4695341271, last ping 4695346272, now 4695351280
Oct 11 04:07:56 proton kernel: connection1:0: detected conn error (1011)
Oct 11 04:07:57 proton iscsid: Kernel reported iSCSI connection 1:0 error (1011) state (3)
Oct 11 04:08:35 proton iscsid: connect to 192.168.254.2:3260 failed (No route to host)
Oct 11 04:08:41 proton iscsid: connect to 192.168.254.2:3260 failed (No route to host)
Oct 11 04:08:47 proton iscsid: connect to 192.168.254.2:3260 failed (No route to host)
Oct 11 04:08:50 proton kernel: BUG: soft lockup - CPU#3 stuck for 61s! [ksoftirqd/3:13]
Oct 11 04:08:50 proton kernel: Modules linked in: vmnet ppdev parport_pc parport vmblock vsock vmci vmmon coretemp hwmon_vid fuse capi capifs kernelcapi be2iscsi bnx2i cnic uio cxgb3i iw_cxgb3 cxgb3 mdio ib_iserOct 11 08:43:16 proton kernel: imklog 4.4.2, log source = /proc/kmsg started.


Version-Release number of selected component (if applicable):
2.6.35.5-29.fc14.x86_64

Additional info:

I had another like issue with 2.6.35.6-40.fc14.x86_64. This time no logs and only pingable. It happened overnight, and the logs ended right around 4am.

Oct 16 04:36:09 proton named[1996]: lame server resolving '34.86.231.24.in-addr.arpa' (in '86.231.24.in-addr.arpa'?)
: 216.104.96.10#53
Oct 16 14:06:15 proton kernel: imklog 4.4.2, log source = /proc/kmsg started.

Comment 1 Mark Seger 2011-01-25 18:54:45 UTC

I have a cluster of 120 identical machines, 64GB RAM and 64 cores.  9 have reported this error on boot (as opposed to during a cron job).  The error occured 5 times on each machine, always twice for CPU0, if that help.  The systems eventually boot and seem to be runnong ok.

It would be very helpful to know if this is a serious problem OR if once the machine boots it is of less concern.

-mark

Comment 2 john 2011-02-18 13:16:18 UTC

Just thought I'd add a "me too".  HP Proliant DL165 G7 64GB RAM and 24 cores.

Relevant dmesg follows.  Only happens under load and system keeps running afterwards.

-John

[65297.462043] BUG: soft lockup - CPU#2 stuck for 61s! [kswapd0:228]
[65297.462043] Modules linked in: ext2 usb_storage ipv6 igb dca i2c_piix4 amd64_edac_mod edac_core i2c_core k10temp edac_mce_amd serio_raw microcode pata_acpi hpsa ata_generic pata_atiixp cciss megaraid_sas [last unloaded: scsi_wait_scan]
[65297.462043] CPU 2 
[65297.462043] Modules linked in: ext2 usb_storage ipv6 igb dca i2c_piix4 amd64_edac_mod edac_core i2c_core k10temp edac_mce_amd serio_raw microcode pata_acpi hpsa ata_generic pata_atiixp cciss megaraid_sas [last unloaded: scsi_wait_scan]
[65297.462043] 
[65297.462043] Pid: 228, comm: kswapd0 Not tainted 2.6.35.6-45.fc14.x86_64 #1 /ProLiant DL165 G7
[65297.462043] RIP: 0010:[<ffffffff810e5cde>]  [<ffffffff810e5cde>] zone_nr_free_pages+0x6a/0x98
[65297.462043] RSP: 0018:ffff8805291e7d00  EFLAGS: 00000287
[65297.462043] RAX: 000000000000000e RBX: ffff8805291e7d20 RCX: ffff880b51c80000
[65297.462043] RDX: 0000000000000000 RSI: 0000000000000100 RDI: 0000000000000100
[65297.462043] RBP: ffffffff8100a68e R08: 0000000000000000 R09: ffffffff81b81f60
[65297.462043] R10: 0000000000000000 R11: ffffffff81b81f60 R12: ffff8805291e7cf0
[65297.462043] R13: ffffffff8100a68e R14: ffff8805291e7cb0 R15: 0000000000000320
[65297.462043] FS:  00007f3bb39c67e0(0000) GS:ffff880002080000(0000) knlGS:00000000de484b70
[65297.462043] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[65297.462043] CR2: 00000000ee28d000 CR3: 0000000dcc190000 CR4: 00000000000006e0
[65297.462043] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[65297.462043] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[65297.462043] Process kswapd0 (pid: 228, threadinfo ffff8805291e6000, task ffff880529275d00)
[65297.462043] Stack:
[65297.462043]  0000000000000000 ffff880100000e00 0000000000000000 0000000000000000
[65297.462043] <0> ffff8805291e7d60 ffffffff810d7049 0000000000000000 ffff880500000000
[65297.462043] <0> ffff880100000000 000000000000000c 0000000000000e00 0000000000000002
[65297.462043] Call Trace:
[65297.462043]  [<ffffffff810d7049>] ? zone_watermark_ok+0x29/0xba
[65297.462043]  [<ffffffff810dff5c>] ? balance_pgdat+0x16a/0x4c8
[65297.462043]  [<ffffffff8100a68e>] ? apic_timer_interrupt+0xe/0x20
[65297.462043]  [<ffffffff810e045e>] ? kswapd+0x1a4/0x1ba
[65297.462043]  [<ffffffff810663c3>] ? autoremove_wake_function+0x0/0x39
[65297.462043]  [<ffffffff810e02ba>] ? kswapd+0x0/0x1ba
[65297.462043]  [<ffffffff81065f29>] ? kthread+0x7f/0x87
[65297.462043]  [<ffffffff8100aae4>] ? kernel_thread_helper+0x4/0x10
[65297.462043]  [<ffffffff81065eaa>] ? kthread+0x0/0x87
[65297.462043]  [<ffffffff8100aae0>] ? kernel_thread_helper+0x0/0x10
[65297.462043] Code: 00 75 4e 4c 8b a7 30 05 00 00 83 c8 ff 4c 8b 2d f1 5b 52 00 eb 18 48 63 c8 48 8b 53 58 48 8b 0c cd 50 04 b8 81 48 0f be 54 0a 42 <49> 01 d4 ff c0 be 00 01 00 00 4c 89 ef 48 63 d0 e8 a9 27 13 00 
[65297.462043] Call Trace:
[65297.462043]  [<ffffffff810d7049>] ? zone_watermark_ok+0x29/0xba
[65297.462043]  [<ffffffff810dff5c>] ? balance_pgdat+0x16a/0x4c8
[65297.462043]  [<ffffffff8100a68e>] ? apic_timer_interrupt+0xe/0x20
[65297.462043]  [<ffffffff810e045e>] ? kswapd+0x1a4/0x1ba
[65297.462043]  [<ffffffff810663c3>] ? autoremove_wake_function+0x0/0x39
[65297.462043]  [<ffffffff810e02ba>] ? kswapd+0x0/0x1ba
[65297.462043]  [<ffffffff81065f29>] ? kthread+0x7f/0x87
[65297.462043]  [<ffffffff8100aae4>] ? kernel_thread_helper+0x4/0x10
[65297.462043]  [<ffffffff81065eaa>] ? kthread+0x0/0x87
[65297.462043]  [<ffffffff8100aae0>] ? kernel_thread_helper+0x0/0x10

Comment 3 Andy Lawrence 2011-04-02 13:04:59 UTC

I am also seeing this in F15. System does not recover, VT switch fails, must reboot.


Apr  2 08:45:30 ace kernel: [ 5950.898533] BUG: soft lockup - CPU#3 stuck for 67s! [kswapd0:49]
Apr  2 08:45:30 ace kernel: [ 5950.898535] Modules linked in: fuse coretemp sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf rfcomm sco bnep l2cap ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter ip6_tables snd_hda_codec_hdmi snd_hda_codec_conexant snd_hda_intel arc4 snd_hda_codec snd_hwdep snd_seq iwlagn iwlcore snd_seq_device mac80211 snd_pcm btusb snd_timer uvcvideo microcode snd e1000e cfg80211 bluetooth iTCO_wdt i2c_i801 joydev soundcore videodev iTCO_vendor_support snd_page_alloc v4l2_compat_ioctl32 rfkill wmi uinput ipv6 firewire_ohci sdhci_pci sdhci firewire_core mmc_core crc_itu_t i915 drm_kms_helper drm i2c_algo_bit i2c_core video [last unloaded: scsi_wait_scan]
Apr  2 08:45:30 ace kernel: [ 5950.898565] CPU 3 
Apr  2 08:45:30 ace kernel: [ 5950.898573] Modules linked in: fuse coretemp sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf rfcomm sco bnep l2cap ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter ip6_tables snd_hda_codec_hdmi snd_hda_codec_conexant snd_hda_intel arc4 snd_hda_codec snd_hwdep snd_seq iwlagn iwlcore snd_seq_device mac80211 snd_pcm btusb snd_timer uvcvideo microcode snd e1000e cfg80211 bluetooth iTCO_wdt i2c_i801 joydev soundcore videodev iTCO_vendor_support snd_page_alloc v4l2_compat_ioctl32 rfkill wmi uinput ipv6 firewire_ohci sdhci_pci sdhci firewire_core mmc_core crc_itu_t i915 drm_kms_helper drm i2c_algo_bit i2c_core video [last unloaded: scsi_wait_scan]
Apr  2 08:45:30 ace kernel: [ 5950.898594] 
Apr  2 08:45:30 ace kernel: [ 5950.898596] Pid: 49, comm: kswapd0 Not tainted 2.6.38.2-10.fc15.x86_64 #1 LENOVO 4239CTO/4239CTO
Apr  2 08:45:30 ace kernel: [ 5950.898599] RIP: 0010:[<ffffffffa007d097>]  [<ffffffffa007d097>] i915_gem_inactive_shrink+0x6c/0x194 [i915]
Apr  2 08:45:30 ace kernel: [ 5950.898611] RSP: 0018:ffff88006ca6fd50  EFLAGS: 00000206
Apr  2 08:45:30 ace kernel: [ 5950.898612] RAX: ffff880041d1c200 RBX: 00000000000000c0 RCX: 0000000000000000
Apr  2 08:45:30 ace kernel: [ 5950.898613] RDX: ffff8800235a44b0 RSI: 0000000000000000 RDI: ffff880037a91820
Apr  2 08:45:30 ace kernel: [ 5950.898615] RBP: ffff88006ca6fd90 R08: 0000000000000004 R09: 0000000000000009
Apr  2 08:45:30 ace kernel: [ 5950.898616] R10: 0000000000000002 R11: ffffffff81a44e40 R12: ffffffff8100a58e
Apr  2 08:45:30 ace kernel: [ 5950.898617] R13: ffff88006ca6fcf0 R14: ffff88006ca6fcf8 R15: ffffffff810dfda7
Apr  2 08:45:30 ace kernel: [ 5950.898619] FS:  0000000000000000(0000) GS:ffff8800786c0000(0000) knlGS:0000000000000000
Apr  2 08:45:30 ace kernel: [ 5950.898621] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
Apr  2 08:45:30 ace kernel: [ 5950.898622] CR2: 00000036f26ac524 CR3: 000000005d3a3000 CR4: 00000000000406e0
Apr  2 08:45:30 ace kernel: [ 5950.898623] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Apr  2 08:45:30 ace kernel: [ 5950.898625] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Apr  2 08:45:30 ace kernel: [ 5950.898626] Process kswapd0 (pid: 49, threadinfo ffff88006ca6e000, task ffff88006ca64560)
Apr  2 08:45:30 ace kernel: [ 5950.898628] Stack:
Apr  2 08:45:30 ace kernel: [ 5950.898629]  ffff88006ca6fd90 ffff88003790b5c8 ffff88006ca6fd60 ffff88003790b580
Apr  2 08:45:30 ace kernel: [ 5950.898631]  0000000000000000 0000000000000000 00000000000000d0 0000000000035cdd
Apr  2 08:45:30 ace kernel: [ 5950.898634]  ffff88006ca6fde0 ffffffff810e44ed 0000000000000001 0000000000000080
Apr  2 08:45:30 ace kernel: [ 5950.898636] Call Trace:
Apr  2 08:45:30 ace kernel: [ 5950.898640]  [<ffffffff810e44ed>] shrink_slab+0x6d/0x166
Apr  2 08:45:30 ace kernel: [ 5950.898643]  [<ffffffff810e7116>] kswapd+0x517/0x77c
Apr  2 08:45:30 ace kernel: [ 5950.898645]  [<ffffffff810e6bff>] ? kswapd+0x0/0x77c
Apr  2 08:45:30 ace kernel: [ 5950.898647]  [<ffffffff8106ea73>] kthread+0x84/0x8c
Apr  2 08:45:30 ace kernel: [ 5950.898650]  [<ffffffff8100a9e4>] kernel_thread_helper+0x4/0x10
Apr  2 08:45:30 ace kernel: [ 5950.898651]  [<ffffffff8106e9ef>] ? kthread+0x0/0x8c
Apr  2 08:45:30 ace kernel: [ 5950.898653]  [<ffffffff8100a9e0>] ? kernel_thread_helper+0x0/0x10
Apr  2 08:45:30 ace kernel: [ 5950.898654] Code: e4 48 89 45 c8 75 37 48 8b 43 48 45 31 ed 48 83 c3 48 48 2d b0 00 00 00 eb 0a 48 8d 82 50 ff ff ff 41 ff c5 48 8b 90 b0 00 00 00 
Apr  2 08:45:30 ace kernel: [ 5950.898673] Call Trace:
Apr  2 08:45:30 ace kernel: [ 5950.898675]  [<ffffffff810e44ed>] shrink_slab+0x6d/0x166
Apr  2 08:45:30 ace kernel: [ 5950.898676]  [<ffffffff810e7116>] kswapd+0x517/0x77c
Apr  2 08:45:30 ace kernel: [ 5950.898678]  [<ffffffff810e6bff>] ? kswapd+0x0/0x77c
Apr  2 08:45:30 ace kernel: [ 5950.898680]  [<ffffffff8106ea73>] kthread+0x84/0x8c
Apr  2 08:45:30 ace kernel: [ 5950.898682]  [<ffffffff8100a9e4>] kernel_thread_helper+0x4/0x10
Apr  2 08:45:30 ace kernel: [ 5950.898683]  [<ffffffff8106e9ef>] ? kthread+0x0/0x8c
Apr  2 08:45:30 ace kernel: [ 5950.898685]  [<ffffffff8100a9e0>] ? kernel_thread_helper+0x0/0x10
Apr  2 08:45:31 ace abrt-dump-oops: Found oopses: 1

Comment 4 Fedora End Of Life 2012-08-16 18:41:50 UTC

This message is a notice that Fedora 14 is now at end of life. Fedora 
has stopped maintaining and issuing updates for Fedora 14. It is 
Fedora's policy to close all bug reports from releases that are no 
longer maintained.  At this time, all open bugs with a Fedora 'version'
of '14' have been closed as WONTFIX.

(Please note: Our normal process is to give advanced warning of this 
occurring, but we forgot to do that. A thousand apologies.)

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, feel free to reopen 
this bug and simply change the 'version' to a later Fedora version.

Bug Reporter: Thank you for reporting this issue and we are sorry that 
we were unable to fix it before Fedora 14 reached end of life. If you 
would still like to see this bug fixed and are able to reproduce it 
against a later version of Fedora, you are encouraged to click on 
"Clone This Bug" (top right of this page) and open it against that 
version of Fedora.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events.  Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

The process we are following is described here: 
http://fedoraproject.org/wiki/BugZappers/HouseKeeping

Note You need to log in before you can comment on or make changes to this bug.