649694 – BUG: soft lockup - CPU#2 stuck for 61s! [kswapd0:128]

Bug 649694 - BUG: soft lockup - CPU#2 stuck for 61s! [kswapd0:128]

Summary: BUG: soft lockup - CPU#2 stuck for 61s! [kswapd0:128]

Keywords:
Status:	CLOSED RAWHIDE
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	kernel
Sub Component:
Version:	14
Hardware:	x86_64
OS:	Linux
Priority:	low
Severity:	high
Target Milestone:	---
Assignee:	Kernel Maintainer List
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	661911
TreeView+	depends on / blocked

Reported:	2010-11-04 10:22 UTC by Catalin BOIE
Modified:	2011-12-22 05:05 UTC (History)
CC List:	13 users (show)
Fixed In Version:
Clone Of:
Clones:	661911 (view as bug list)
Environment:
Last Closed:	2010-12-02 18:34:48 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
dmesg output for the "CPU stuck" error. (39.60 KB, text/plain) 2010-11-09 04:36 UTC, Luke Hutchison	no flags	Details
View All

Description Catalin BOIE 2010-11-04 10:22:19 UTC

Description of problem:
I get some BUGs in logs and the ssh connection are stuck for a lot of time.

Version-Release number of selected component (if applicable):
2.6.35.6-48.fc14.x86_64

How reproducible:
I have no idea how to reproduce.

Steps to Reproduce:
1.
2.
3.
  
Actual results:
Stuck somewhere in the kernel.

Expected results:
No stuck.

Additional info:
free:
             total       used       free     shared    buffers     cached
Mem:      16522288   16411324     110964          0    *5111400*      *48728*
-/+ buffers/cache:   11251196    5271092

BUG: soft lockup - CPU#2 stuck for 61s! [kswapd0:128]
Modules linked in: cpufreq_stats freq_table ipt_MASQUERADE iptable_nat nf_nat sit tunnel4 tun ebtable_nat ebtables bridge stp llc ip6t_REJECT nf_conntrack_ipv6 ip6table_filter ip6_tables ipv6 vfat fat kvm_intel kvm bnx2 i7core_edac ioatdma edac_core cdc_ether matroxfb_base matroxfb_DAC1064 matroxfb_accel matroxfb_Ti3026 matroxfb_g450 g450_pll matroxfb_misc usbnet shpchp i2c_i801 i2c_core iTCO_wdt iTCO_vendor_support dca joydev serio_raw mii usb_storage megaraid
CPU 2
Modules linked in: cpufreq_stats freq_table ipt_MASQUERADE iptable_nat nf_nat sit tunnel4 tun ebtable_nat ebtables bridge stp llc ip6t_REJECT nf_conntrack_ipv6 ip6table_filter ip6_tables ipv6 vfat fat kvm_intel kvm bnx2 i7core_edac ioatdma edac_core cdc_ether matroxfb_base matroxfb_DAC1064 matroxfb_accel matroxfb_Ti3026 matroxfb_g450 g450_pll matroxfb_misc usbnet shpchp i2c_i801 i2c_core iTCO_wdt iTCO_vendor_support dca joydev serio_raw mii usb_storage megaraid

Pid: 128, comm: kswapd0 Not tainted 2.6.35.6-48.fc14.x86_64 #1 69Y4438     /System x3650 M3 -[7945K3G]-
RIP: 0010:[<ffffffff810e5c8d>]  [<ffffffff810e5c8d>] zone_nr_free_pages+0x19/0x98
RSP: 0018:ffff8802751e1d00  EFLAGS: 00000286
RAX: 0000000000000000 RBX: ffff8802751e1d20 RCX: 0000000000000000
RDX: 0000000000000f8b RSI: 0000000000000000 RDI: ffff880100000000
RBP: ffffffff8100a68e R08: 0000000000000000 R09: ffffffff81b81f80
R10: 0000000000000000 R11: ffffffff81b81f60 R12: ffff8802751e1cf0
R13: ffffffff8100a68e R14: ffff8802751e1cb0 R15: 0000000000000060
FS:  0000000000000000(0000) GS:ffff880002040000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00007fa6d6414c40 CR3: 0000000001a42000 CR4: 00000000000026e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process kswapd0 (pid: 128, threadinfo ffff8802751e0000, task ffff880276972e80)
Stack:
 0000000000000000 ffff880100000000 0000000000000000 0000000000000000
<0> ffff8802751e1d60 ffffffff810d7049 0000000000000000 ffff880200000000
<0> ffff880100000000 000000000000000c 0000000000000000 0000000000000000
Call Trace:
 [<ffffffff810d7049>] ? zone_watermark_ok+0x29/0xba
 [<ffffffff810dff5c>] ? balance_pgdat+0x16a/0x4c8
 [<ffffffff810e045e>] ? kswapd+0x1a4/0x1ba
 [<ffffffff810663c3>] ? autoremove_wake_function+0x0/0x39
 [<ffffffff810e02ba>] ? kswapd+0x0/0x1ba
 [<ffffffff81065f29>] ? kthread+0x7f/0x87
 [<ffffffff8100aae4>] ? kernel_thread_helper+0x4/0x10
 [<ffffffff81065eaa>] ? kthread+0x0/0x87
 [<ffffffff8100aae0>] ? kernel_thread_helper+0x0/0x10
Code: 40 0f a3 3a 19 ff 85 ff 74 e3 48 8b 10 48 89 11 c9 c3 55 48 89 e5 41 55 41 54 53 48 83 ec 08 0f 1f 44 00 00 48 8b 97 30 05 00 00 <31> c0 48 89 fb 48 85 d2 48 0f 49 c2 48 3b 47 18 73 65 48 8b 97
Call Trace:
 [<ffffffff810d7049>] ? zone_watermark_ok+0x29/0xba
 [<ffffffff810dff5c>] ? balance_pgdat+0x16a/0x4c8
 [<ffffffff810e045e>] ? kswapd+0x1a4/0x1ba
 [<ffffffff810663c3>] ? autoremove_wake_function+0x0/0x39
 [<ffffffff810e02ba>] ? kswapd+0x0/0x1ba
 [<ffffffff81065f29>] ? kthread+0x7f/0x87
 [<ffffffff8100aae4>] ? kernel_thread_helper+0x4/0x10
 [<ffffffff81065eaa>] ? kthread+0x0/0x87
 [<ffffffff8100aae0>] ? kernel_thread_helper+0x0/0x10

Comment 1 Mark F 2010-11-05 00:17:05 UTC

I'm having the same issue.  2.6.35.6-45.fc14.x86_64

I can reproduce consistently by importing a large mysql dump.

# free -k
             total       used       free     shared    buffers     cached
Mem:       4114904    4090508      24396          0      57732    2577020
-/+ buffers/cache:    1455756    2659148
Swap:            0          0          0

Mysql's innodb buffer is sized at 1G.  During the import mysqld has about 30% mem.  I have swap disabled here but the same happens with 4G of swap.  As the import progresses, kswapd starts taking progressively more cpu time and eventually these start showing up:

[ 2536.178964] BUG: soft lockup - CPU#4 stuck for 61s! [kswapd1:73]
[ 2536.179329] Modules linked in: ipv6 i7core_edac edac_core iTCO_wdt iTCO_vendor_support tg3 serio_raw hed raid0 raid1 [last unloaded: scsi_wait_scan]
[ 2536.179339] CPU 4
[ 2536.179340] Modules linked in: ipv6 i7core_edac edac_core iTCO_wdt iTCO_vendor_support tg3 serio_raw hed raid0 raid1 [last unloaded: scsi_wait_scan]
[ 2536.179348]
[ 2536.179352] Pid: 73, comm: kswapd1 Not tainted 2.6.35.6-45.fc14.x86_64 #1 /ProLiant ML150 G6
[ 2536.179354] RIP: 0010:[<ffffffff8121852f>]  [<ffffffff8121852f>] find_next_bit+0x93/0x9c
[ 2536.179366] RSP: 0018:ffff88007bd6fcf0  EFLAGS: 00000206
[ 2536.179368] RAX: 00000000000000fc RBX: ffff88007bd6fcf0 RCX: 0000000000000002
[ 2536.179371] RDX: 0000000000000002 RSI: 0000000000000100 RDI: 0000000000000100
[ 2536.179373] RBP: ffffffff8100a68e R08: 0000000000000000 R09: ffffffff81b81f60
[ 2536.179375] R10: 0000000000000000 R11: ffffffff81b81f60 R12: ffff88007bd6fcb0
[ 2536.179377] R13: 00000000000000a0 R14: ffff88007bd6fc70 R15: ffff88007bd6fc78
[ 2536.179380] FS:  0000000000000000(0000) GS:ffff880102000000(0000) knlGS:0000000000000000
[ 2536.179383] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 2536.179385] CR2: 00007f4682116026 CR3: 0000000008140000 CR4: 00000000000006e0
[ 2536.179388] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 2536.179390] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 2536.179393] Process kswapd1 (pid: 73, threadinfo ffff88007bd6e000, task ffff88007bd71740)
[ 2536.179395] Stack:
[ 2536.179730]  ffff88007bd6fd20 ffffffff810e5cf3 0000000000000000 ffff880100000e00
[ 2536.179734] <0> 0000000000000000 0000000000000000 ffff88007bd6fd60 ffffffff810d7049
[ 2536.179737] <0> ffffffffffffff10 ffffffff00000000 ffff880100000000 000000000000000c
[ 2536.179741] Call Trace:
[ 2536.180148]  [<ffffffff810e5cf3>] ? zone_nr_free_pages+0x7f/0x98
[ 2536.180156]  [<ffffffff810d7049>] ? zone_watermark_ok+0x29/0xba
[ 2536.180160]  [<ffffffff810dff5c>] ? balance_pgdat+0x16a/0x4c8
[ 2536.180164]  [<ffffffff810e045e>] ? kswapd+0x1a4/0x1ba
[ 2536.180170]  [<ffffffff810663c3>] ? autoremove_wake_function+0x0/0x39
[ 2536.180173]  [<ffffffff810e02ba>] ? kswapd+0x0/0x1ba
[ 2536.180176]  [<ffffffff81065f29>] ? kthread+0x7f/0x87
[ 2536.180182]  [<ffffffff8100aae4>] ? kernel_thread_helper+0x4/0x10
[ 2536.180185]  [<ffffffff81065eaa>] ? kthread+0x0/0x87
[ 2536.180188]  [<ffffffff8100aae0>] ? kernel_thread_helper+0x0/0x10
[ 2536.180189] Code: c7 c0 ff ff ff 75 e3 48 85 ff 4c 89 c0 74 23 49 8b 01 b9 40 00 00 00 48 83 ca ff 29 f9 48 d3 ea 48 21 d0 75 06 49 8d 04 38 eb 07 <48> 0f bc c0 4c 01 c0 c9 c3 55 48 39 f2 48 89 f0 48 89 e5 0f 83
[ 2536.181502] Call Trace:
[ 2536.181505]  [<ffffffff810e5cf3>] ? zone_nr_free_pages+0x7f/0x98
[ 2536.181508]  [<ffffffff810d7049>] ? zone_watermark_ok+0x29/0xba
[ 2536.181512]  [<ffffffff810dff5c>] ? balance_pgdat+0x16a/0x4c8
[ 2536.181515]  [<ffffffff810e045e>] ? kswapd+0x1a4/0x1ba
[ 2536.181518]  [<ffffffff810663c3>] ? autoremove_wake_function+0x0/0x39
[ 2536.181522]  [<ffffffff810e02ba>] ? kswapd+0x0/0x1ba
[ 2536.181524]  [<ffffffff81065f29>] ? kthread+0x7f/0x87
[ 2536.181527]  [<ffffffff8100aae4>] ? kernel_thread_helper+0x4/0x10
[ 2536.181530]  [<ffffffff81065eaa>] ? kthread+0x0/0x87
[ 2536.181533]  [<ffffffff8100aae0>] ? kernel_thread_helper+0x0/0x10

Comment 2 Hank 2010-11-05 11:08:58 UTC

Help! Eventually it hangs the machine:    


Nov  5 00:23:07 hesj3-m31 kernel: [1347687.723063] BUG: soft lockup - CPU#3 stuck for 61s! [kswapd0:134]
Nov  5 00:23:07 hesj3-m31 kernel: [1347687.723063] Modules linked in: ip_vs libcrc32c nfsd lockd nfs_acl auth_rpcgss exportfs sunrpc ipv6 i2c_piix4 amd64_edac_mod i2c_core igb edac_core edac_mce_amd ser
io_raw k10temp dca microcode jfs raid456 async_raid6_recov async_pq raid6_pq async_xor xor async_memcpy async_tx raid1 pata_acpi ata_generic usb_storage pata_atiixp mpt2sas scsi_transport_sas raid_class
 [last unloaded: scsi_wait_scan]
Nov  5 00:23:07 hesj3-m31 kernel: [1347687.723063] CPU 3 
Nov  5 00:23:07 hesj3-m31 kernel: [1347687.723063] Modules linked in: ip_vs libcrc32c nfsd lockd nfs_acl auth_rpcgss exportfs sunrpc ipv6 i2c_piix4 amd64_edac_mod i2c_core igb edac_core edac_mce_amd ser
io_raw k10temp dca microcode jfs raid456 async_raid6_recov async_pq raid6_pq async_xor xor async_memcpy async_tx raid1 pata_acpi ata_generic usb_storage pata_atiixp mpt2sas scsi_transport_sas raid_class
 [last unloaded: scsi_wait_scan]
Nov  5 00:23:07 hesj3-m31 kernel: [1347687.723063] 
Nov  5 00:23:07 hesj3-m31 kernel: [1347687.723063] Pid: 134, comm: kswapd0 Not tainted 2.6.35.6-43.fc14.x86_64 #1 H8DGU/H8DGU
Nov  5 00:23:07 hesj3-m31 kernel: [1347687.723063] RIP: 0010:[<ffffffff81077e57>]  [<ffffffff81077e57>] raw_local_irq_restore+0xb/0x12
Nov  5 00:23:07 hesj3-m31 kernel: [1347687.723063] RSP: 0018:ffff880214d25e00  EFLAGS: 00000286
Nov  5 00:23:07 hesj3-m31 kernel: [1347687.723063] RAX: ffff880100013e78 RBX: ffff880214d25e00 RCX: 000000000000e055
Nov  5 00:23:07 hesj3-m31 kernel: [1347687.723063] RDX: ffff880214db2e80 RSI: 0000000000000286 RDI: 0000000000000286
Nov  5 00:23:07 hesj3-m31 kernel: [1347687.723063] RBP: ffffffff8100a68e R08: ffff880100013e80 R09: ffffffff81b81f80
Nov  5 00:23:07 hesj3-m31 kernel: [1347687.723063] R10: 0000000000000000 R11: ffffffff81b81f60 R12: ffff880214d25e20
Nov  5 00:23:07 hesj3-m31 kernel: [1347687.723063] R13: ffffffff810dff5c R14: ffff880214d25e50 R15: 0000000000000000
Nov  5 00:23:07 hesj3-m31 kernel: [1347687.723063] FS:  00007fe9da235720(0000) GS:ffff880002060000(0000) knlGS:0000000000000000
Nov  5 00:23:07 hesj3-m31 kernel: [1347687.723063] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
Nov  5 00:23:07 hesj3-m31 kernel: [1347687.723063] CR2: 000000000386bff0 CR3: 0000000040e22000 CR4: 00000000000006e0
Nov  5 00:23:07 hesj3-m31 kernel: [1347687.723063] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Nov  5 00:23:07 hesj3-m31 kernel: [1347687.723063] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Nov  5 00:23:07 hesj3-m31 kernel: [1347687.723063] Process kswapd0 (pid: 134, threadinfo ffff880214d24000, task ffff880214db2e80)
Nov  5 00:23:07 hesj3-m31 kernel: [1347687.723063] Stack:
Nov  5 00:23:07 hesj3-m31 kernel: [1347687.723063]  ffff880214d25e10 ffffffff81469057 ffff880214d25e50 ffffffff81066656
Nov  5 00:23:07 hesj3-m31 kernel: [1347687.723063] <0> ffff880214d25e50 0000000000000000 ffff880100000000 0000000000000000
Nov  5 00:23:07 hesj3-m31 kernel: [1347687.723063] <0> ffff880214d25e70 ffff880100013e78 ffff880214d25ee0 ffffffff810e037a
Nov  5 00:23:07 hesj3-m31 kernel: [1347687.723063] Call Trace:
Nov  5 00:23:07 hesj3-m31 kernel: [1347687.723063]  [<ffffffff81469057>] ? _raw_spin_unlock_irqrestore+0x17/0x19
Nov  5 00:23:07 hesj3-m31 kernel: [1347687.723063]  [<ffffffff81066656>] ? prepare_to_wait+0x6c/0x79
Nov  5 00:23:07 hesj3-m31 kernel: [1347687.723063]  [<ffffffff810e037a>] ? kswapd+0xc0/0x1ba
Nov  5 00:23:07 hesj3-m31 kernel: [1347687.723063]  [<ffffffff810663c3>] ? autoremove_wake_function+0x0/0x39
Nov  5 00:23:07 hesj3-m31 kernel: [1347687.723063]  [<ffffffff810e02ba>] ? kswapd+0x0/0x1ba
Nov  5 00:23:07 hesj3-m31 kernel: [1347687.723063]  [<ffffffff81065f29>] ? kthread+0x7f/0x87
Nov  5 00:23:07 hesj3-m31 kernel: [1347687.723063]  [<ffffffff8100aae4>] ? kernel_thread_helper+0x4/0x10
Nov  5 00:23:07 hesj3-m31 kernel: [1347687.723063]  [<ffffffff81065eaa>] ? kthread+0x0/0x87
Nov  5 00:23:07 hesj3-m31 kernel: [1347687.723063]  [<ffffffff8100aae0>] ? kernel_thread_helper+0x0/0x10
Nov  5 00:23:07 hesj3-m31 kernel: [1347687.723063] Code: e8 ee 11 3f 00 c9 c3 55 48 89 e5 0f 1f 44 00 00 66 ff 05 8d b3 9a 00 fb 66 66 90 66 66 90 c9 c3 55 48 89 e5 0f 1f 44 00 00 57 9d <66> 66 90 66 90
 c9 c3 55 48 89 e5 0f 1f 44 00 00 fa 66 66 90 66 
Nov  5 00:23:07 hesj3-m31 kernel: [1347687.723063] Call Trace:
Nov  5 00:23:07 hesj3-m31 kernel: [1347687.723063]  [<ffffffff81469057>] ? _raw_spin_unlock_irqrestore+0x17/0x19
Nov  5 00:23:07 hesj3-m31 kernel: [1347687.723063]  [<ffffffff81066656>] ? prepare_to_wait+0x6c/0x79
Nov  5 00:23:07 hesj3-m31 kernel: [1347687.723063]  [<ffffffff810e037a>] ? kswapd+0xc0/0x1ba
Nov  5 00:23:07 hesj3-m31 kernel: [1347687.723063]  [<ffffffff810663c3>] ? autoremove_wake_function+0x0/0x39
Nov  5 00:23:07 hesj3-m31 kernel: [1347687.723063]  [<ffffffff810e02ba>] ? kswapd+0x0/0x1ba
Nov  5 00:23:07 hesj3-m31 kernel: [1347687.723063]  [<ffffffff81065f29>] ? kthread+0x7f/0x87
Nov  5 00:23:07 hesj3-m31 kernel: [1347687.723063]  [<ffffffff8100aae4>] ? kernel_thread_helper+0x4/0x10
Nov  5 00:23:07 hesj3-m31 kernel: [1347687.723063]  [<ffffffff81065eaa>] ? kthread+0x0/0x87
Nov  5 00:23:07 hesj3-m31 kernel: [1347687.723063]  [<ffffffff8100aae0>] ? kernel_thread_helper+0x0/0x10
Nov  5 00:24:12 hesj3-m31 kernel: [1347753.222063] BUG: soft lockup - CPU#3 stuck for 61s! [kswapd0:134]
Nov  5 00:24:12 hesj3-m31 kernel: [1347753.222063] Modules linked in: ip_vs libcrc32c nfsd lockd nfs_acl auth_rpcgss exportfs sunrpc ipv6 i2c_piix4 amd64_edac_mod i2c_core igb edac_core edac_mce_amd ser
io_raw k10temp dca microcode jfs raid456 async_raid6_recov async_pq raid6_pq async_xor xor async_memcpy async_tx raid1 pata_acpi ata_generic usb_storage pata_atiixp mpt2sas scsi_transport_sas raid_class
 [last unloaded: scsi_wait_scan]
Nov  5 00:24:12 hesj3-m31 kernel: [1347753.222063] CPU 3 
Nov  5 00:24:12 hesj3-m31 kernel: [1347753.222063] Modules linked in: ip_vs libcrc32c nfsd lockd nfs_acl auth_rpcgss exportfs sunrpc ipv6 i2c_piix4 amd64_edac_mod i2c_core igb edac_core edac_mce_amd ser
io_raw k10temp dca microcode jfs raid456 async_raid6_recov async_pq raid6_pq async_xor xor async_memcpy async_tx raid1 pata_acpi ata_generic usb_storage pata_atiixp mpt2sas scsi_transport_sas raid_class
 [last unloaded: scsi_wait_scan]
Nov  5 00:24:12 hesj3-m31 kernel: [1347753.222063] 
Nov  5 00:24:12 hesj3-m31 kernel: [1347753.222063] Pid: 134, comm: kswapd0 Not tainted 2.6.35.6-43.fc14.x86_64 #1 H8DGU/H8DGU
Nov  5 00:24:12 hesj3-m31 kernel: [1347753.222063] RIP: 0010:[<ffffffff810d70d1>]  [<ffffffff810d70d1>] zone_watermark_ok+0xb1/0xba
Nov  5 00:24:12 hesj3-m31 kernel: [1347753.222063] RSP: 0018:ffff880214d25e00  EFLAGS: 00000283
Nov  5 00:24:12 hesj3-m31 kernel: [1347753.222063] RAX: 0000000000000000 RBX: ffff880214d25e20 RCX: 0000000000000697
Nov  5 00:24:12 hesj3-m31 kernel: [1347753.222063] RDX: 00000000000006b4 RSI: 0000000000000286 RDI: 0000000000000000
Nov  5 00:24:12 hesj3-m31 kernel: [1347753.222063] RBP: ffffffff8100a68e R08: 0000000000000000 R09: ffff880214d25e50
Nov  5 00:24:12 hesj3-m31 kernel: [1347753.222063] R10: 0000000000000000 R11: ffffffff81b81f60 R12: ffff880214d25df8
Nov  5 00:24:12 hesj3-m31 kernel: [1347753.222063] R13: ffffffff810dff5c R14: ffff880214d25e50 R15: 0000000000000000
Nov  5 00:24:12 hesj3-m31 kernel: [1347753.222063] FS:  00007fe9da235720(0000) GS:ffff880002060000(0000) knlGS:0000000000000000
Nov  5 00:24:12 hesj3-m31 kernel: [1347753.222063] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
Nov  5 00:24:12 hesj3-m31 kernel: [1347753.222063] CR2: 000000000386bff0 CR3: 0000000040e22000 CR4: 00000000000006e0
Nov  5 00:24:12 hesj3-m31 kernel: [1347753.222063] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Nov  5 00:24:12 hesj3-m31 kernel: [1347753.222063] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Nov  5 00:24:12 hesj3-m31 kernel: [1347753.222063] Process kswapd0 (pid: 134, threadinfo ffff880214d24000, task ffff880214db2e80)
Nov  5 00:24:12 hesj3-m31 kernel: [1347753.222063] Stack:
Nov  5 00:24:12 hesj3-m31 kernel: [1347753.222063]  0000000000000002 ffff880100000000 0000000000000000 ffff880100013e78
Nov  5 00:24:12 hesj3-m31 kernel: [1347753.222063] <0> ffff880214d25e50 ffffffff810dd93b ffff880100000000 ffff880100000000
Nov  5 00:24:12 hesj3-m31 kernel: [1347753.222063] <0> 0000000000000000 ffff880214d25e70 ffff880214d25ee0 ffffffff810e03fc
Nov  5 00:24:12 hesj3-m31 kernel: [1347753.222063] Call Trace:
Nov  5 00:24:12 hesj3-m31 kernel: [1347753.222063]  [<ffffffff810dd93b>] ? sleeping_prematurely+0x55/0x76
Nov  5 00:24:12 hesj3-m31 kernel: [1347753.222063]  [<ffffffff810e03fc>] ? kswapd+0x142/0x1ba
Nov  5 00:24:12 hesj3-m31 kernel: [1347753.222063]  [<ffffffff810663c3>] ? autoremove_wake_function+0x0/0x39
Nov  5 00:24:12 hesj3-m31 kernel: [1347753.222063]  [<ffffffff810e02ba>] ? kswapd+0x0/0x1ba
Nov  5 00:24:12 hesj3-m31 kernel: [1347753.222063]  [<ffffffff81065f29>] ? kthread+0x7f/0x87
Nov  5 00:24:12 hesj3-m31 kernel: [1347753.222063]  [<ffffffff8100aae4>] ? kernel_thread_helper+0x4/0x10
Nov  5 00:24:12 hesj3-m31 kernel: [1347753.222063]  [<ffffffff81065eaa>] ? kthread+0x0/0x87
Nov  5 00:24:12 hesj3-m31 kernel: [1347753.222063]  [<ffffffff8100aae0>] ? kernel_thread_helper+0x0/0x10
Nov  5 00:24:12 hesj3-m31 kernel: [1347753.222063] Code: 48 8b 93 c0 00 00 00 49 d1 fe 48 83 c3 58 48 d3 e2 48 29 d0 4c 39 f0 7e 0e ff c1 44 39 e1 7c e0 b8 01 00 00 00 eb 02 31 c0 5e 5f <5b> 41 5c 41 5d
 41 5e c9 c3 55 48 89 e5 0f 1f 44 00 00 bf 03 00 
Nov  5 00:24:12 hesj3-m31 kernel: [1347753.222063] Call Trace:
Nov  5 00:24:12 hesj3-m31 kernel: [1347753.222063]  [<ffffffff810dd93b>] ? sleeping_prematurely+0x55/0x76
Nov  5 00:24:12 hesj3-m31 kernel: [1347753.222063]  [<ffffffff810e03fc>] ? kswapd+0x142/0x1ba
Nov  5 00:24:12 hesj3-m31 kernel: [1347753.222063]  [<ffffffff810663c3>] ? autoremove_wake_function+0x0/0x39
Nov  5 00:24:12 hesj3-m31 kernel: [1347753.222063]  [<ffffffff810e02ba>] ? kswapd+0x0/0x1ba
Nov  5 00:24:12 hesj3-m31 kernel: [1347753.222063]  [<ffffffff81065f29>] ? kthread+0x7f/0x87
Nov  5 00:24:12 hesj3-m31 kernel: [1347753.222063]  [<ffffffff8100aae4>] ? kernel_thread_helper+0x4/0x10
Nov  5 00:24:12 hesj3-m31 kernel: [1347753.222063]  [<ffffffff81065eaa>] ? kthread+0x0/0x87
Nov  5 00:24:12 hesj3-m31 kernel: [1347753.222063]  [<ffffffff8100aae0>] ? kernel_thread_helper+0x0/0x10
Nov  5 00:25:18 hesj3-m31 kernel: [1347818.720063] BUG: soft lockup - CPU#3 stuck for 61s! [kswapd0:134]
Nov  5 00:25:18 hesj3-m31 kernel: [1347818.720063] Modules linked in: ip_vs libcrc32c nfsd lockd nfs_acl auth_rpcgss exportfs sunrpc ipv6 i2c_piix4 amd64_edac_mod i2c_core igb edac_core edac_mce_amd ser
io_raw k10temp dca microcode jfs raid456 async_raid6_recov async_pq raid6_pq async_xor xor async_memcpy async_tx raid1 pata_acpi ata_generic usb_storage pata_atiixp mpt2sas scsi_transport_sas raid_class
 [last unloaded: scsi_wait_scan]
Nov  5 00:25:18 hesj3-m31 kernel: [1347818.720063] CPU 3 
Nov  5 00:25:18 hesj3-m31 kernel: [1347818.720063] Modules linked in: ip_vs libcrc32c nfsd lockd nfs_acl auth_rpcgss exportfs sunrpc ipv6 i2c_piix4 amd64_edac_mod i2c_core igb edac_core edac_mce_amd ser
io_raw k10temp dca microcode jfs raid456 async_raid6_recov async_pq raid6_pq async_xor xor async_memcpy async_tx raid1 pata_acpi ata_generic usb_storage pata_atiixp mpt2sas scsi_transport_sas raid_class
 [last unloaded: scsi_wait_scan]
Nov  5 00:25:18 hesj3-m31 kernel: [1347818.720063] 
Nov  5 00:25:18 hesj3-m31 kernel: [1347818.720063] Pid: 134, comm: kswapd0 Not tainted 2.6.35.6-43.fc14.x86_64 #1 H8DGU/H8DGU
Nov  5 00:25:18 hesj3-m31 kernel: [1347818.720063] RIP: 0010:[<ffffffff812184ea>]  [<ffffffff812184ea>] find_next_bit+0x9a/0x9c
Nov  5 00:25:18 hesj3-m31 kernel: [1347818.720063] RSP: 0018:ffff880214d25cf0  EFLAGS: 00000202
Nov  5 00:25:18 hesj3-m31 kernel: [1347818.720063] RAX: 0000000000000008 RBX: ffff880214d25cf0 RCX: 0000000000000008
Nov  5 00:25:18 hesj3-m31 kernel: [1347818.720063] RDX: 0000000000000008 RSI: 0000000000000100 RDI: 0000000000000100
Nov  5 00:25:18 hesj3-m31 kernel: [1347818.720063] RBP: ffffffff8100a68e R08: 0000000000000000 R09: ffffffff81b81f60
Nov  5 00:25:18 hesj3-m31 kernel: [1347818.720063] R10: 0000000000000000 R11: ffffffff81b81f60 R12: ffff880214d25c90
Nov  5 00:25:18 hesj3-m31 kernel: [1347818.720063] R13: ffff880214db2eb8 R14: ffff880214d25d9c R15: ffff880002075570
Nov  5 00:25:18 hesj3-m31 kernel: [1347818.720063] FS:  00007fe9da235720(0000) GS:ffff880002060000(0000) knlGS:0000000000000000
Nov  5 00:25:18 hesj3-m31 kernel: [1347818.720063] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
Nov  5 00:25:18 hesj3-m31 kernel: [1347818.720063] CR2: 000000000386bff0 CR3: 0000000040e22000 CR4: 00000000000006e0
Nov  5 00:25:18 hesj3-m31 kernel: [1347818.720063] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Nov  5 00:25:18 hesj3-m31 kernel: [1347818.720063] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Nov  5 00:25:18 hesj3-m31 kernel: [1347818.720063] Process kswapd0 (pid: 134, threadinfo ffff880214d24000, task ffff880214db2e80)
Nov  5 00:25:18 hesj3-m31 kernel: [1347818.720063] Stack:
Nov  5 00:25:18 hesj3-m31 kernel: [1347818.720063]  ffff880214d25d20 ffffffff810e5cf3 0000000000000000 ffff880100000e00
Nov  5 00:25:18 hesj3-m31 kernel: [1347818.720063] <0> 0000000000000000 0000000000000000 ffff880214d25d60 ffffffff810d7049
Nov  5 00:25:18 hesj3-m31 kernel: [1347818.720063] <0> 0000000000000000 ffff880200000000 ffff880100000000 000000000000000c
Nov  5 00:25:18 hesj3-m31 kernel: [1347818.720063] Call Trace:
Nov  5 00:25:18 hesj3-m31 kernel: [1347818.720063]  [<ffffffff810e5cf3>] ? zone_nr_free_pages+0x7f/0x98
Nov  5 00:25:18 hesj3-m31 kernel: [1347818.720063]  [<ffffffff810d7049>] ? zone_watermark_ok+0x29/0xba
Nov  5 00:25:18 hesj3-m31 kernel: [1347818.720063]  [<ffffffff810dff5c>] ? balance_pgdat+0x16a/0x4c8
Nov  5 00:25:18 hesj3-m31 kernel: [1347818.720063]  [<ffffffff810e045e>] ? kswapd+0x1a4/0x1ba
Nov  5 00:25:18 hesj3-m31 kernel: [1347818.720063]  [<ffffffff810663c3>] ? autoremove_wake_function+0x0/0x39
Nov  5 00:25:18 hesj3-m31 kernel: [1347818.720063]  [<ffffffff810e02ba>] ? kswapd+0x0/0x1ba
Nov  5 00:25:18 hesj3-m31 kernel: [1347818.720063]  [<ffffffff81065f29>] ? kthread+0x7f/0x87
Nov  5 00:25:18 hesj3-m31 kernel: [1347818.720063]  [<ffffffff8100aae4>] ? kernel_thread_helper+0x4/0x10
Nov  5 00:25:18 hesj3-m31 kernel: [1347818.720063]  [<ffffffff81065eaa>] ? kthread+0x0/0x87
Nov  5 00:25:18 hesj3-m31 kernel: [1347818.720063]  [<ffffffff8100aae0>] ? kernel_thread_helper+0x0/0x10
Nov  5 00:25:18 hesj3-m31 kernel: [1347818.720063] Code: 48 85 ff 4c 89 c0 74 23 49 8b 01 b9 40 00 00 00 48 83 ca ff 29 f9 48 d3 ea 48 21 d0 75 06 49 8d 04 38 eb 07 48 0f bc c0 4c 01 c0 <c9> c3 55 48 39
 f2 48 89 f0 48 89 e5 0f 83 94 00 00 00 48 89 d1 
Nov  5 00:25:18 hesj3-m31 kernel: [1347818.720063] Call Trace:
Nov  5 00:25:18 hesj3-m31 kernel: [1347818.720063]  [<ffffffff810e5cf3>] ? zone_nr_free_pages+0x7f/0x98
Nov  5 00:25:18 hesj3-m31 kernel: [1347818.720063]  [<ffffffff810d7049>] ? zone_watermark_ok+0x29/0xba
Nov  5 00:25:18 hesj3-m31 kernel: [1347818.720063]  [<ffffffff810dff5c>] ? balance_pgdat+0x16a/0x4c8
Nov  5 00:25:18 hesj3-m31 kernel: [1347818.720063]  [<ffffffff810e045e>] ? kswapd+0x1a4/0x1ba
Nov  5 00:25:18 hesj3-m31 kernel: [1347818.720063]  [<ffffffff810663c3>] ? autoremove_wake_function+0x0/0x39
Nov  5 00:25:18 hesj3-m31 kernel: [1347818.720063]  [<ffffffff810e02ba>] ? kswapd+0x0/0x1ba
Nov  5 00:25:18 hesj3-m31 kernel: [1347818.720063]  [<ffffffff81065f29>] ? kthread+0x7f/0x87
Nov  5 00:25:18 hesj3-m31 kernel: [1347818.720063]  [<ffffffff8100aae4>] ? kernel_thread_helper+0x4/0x10
Nov  5 00:25:18 hesj3-m31 kernel: [1347818.720063]  [<ffffffff81065eaa>] ? kthread+0x0/0x87
Nov  5 00:25:18 hesj3-m31 kernel: [1347818.720063]  [<ffffffff8100aae0>] ? kernel_thread_helper+0x0/0x10


I rebooted with " acpi=off noapic nolapic " to try to keep the machine running so I could move 1 TB of small files to/from it.

# cat /proc/cpuinfo
processor	: 0
vendor_id	: AuthenticAMD
cpu family	: 16
model		: 9
model name	: AMD Opteron(tm) Processor 6172
stepping	: 1
cpu MHz		: 2100.115
cache size	: 512 KB
physical id	: 0
siblings	: 1
core id		: 0
cpu cores	: 1
apicid		: 16
initial apicid	: 0
fpu		: yes
fpu_exception	: yes
cpuid level	: 5
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc rep_good nonstop_tsc amd_dcm pni monitor cx16 popcnt lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt nodeid_msr npt lbrv svm_lock nrip_save
bogomips	: 4200.22
TLB size	: 1024 4K pages
clflush size	: 64
cache_alignment	: 64
address sizes	: 48 bits physical, 48 bits virtual
power management: ts ttp tm stc 100mhzsteps hwpstate

Is there a way to hot-add and hot-remove more cores?

Thanks.

Comment 3 Luke Hutchison 2010-11-09 04:36:34 UTC

Created attachment 458967 [details]
dmesg output for the "CPU stuck" error.

Same problem on a 12-way Xeon system.  The system was out of RAM and swapping out to disk, and after compiling some code directly on an AFS share, this error occurred multiple times (but the timeouts happened in different functions, see the attachment cpustuck.txt).

Comment 4 Luke Hutchison 2010-11-09 04:38:12 UTC

My /proc/cpuinfo for one core, in case it's useful:


processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 44
model name      : Intel(R) Xeon(R) CPU           X5680  @ 3.33GHz
stepping        : 2
cpu MHz         : 1600.000
cache size      : 12288 KB
physical id     : 0
siblings        : 12
core id         : 0
cpu cores       : 6
apicid          : 0
initial apicid  : 0
fpu             : yes
fpu_exception   : yes
cpuid level     : 11
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm dca sse4_1 sse4_2 popcnt aes lahf_lm ida arat tpr_shadow vnmi flexpriority ept vpid
bogomips        : 6649.76
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management:

Do you need lspci output or anything else?

Comment 5 Luke Hutchison 2010-11-09 05:13:16 UTC

This may or may not be important, but my swap partition is actually in a logical volume on top of a hardware RAID array.

(This error seems to happen when the system is swapping?)

Comment 6 Luke Hutchison 2010-11-09 05:41:26 UTC

This appears to have been fixed upstream:

http://kerneltrap.org/mailarchive/linux-kernel/2010/10/27/4637977

Can this fix please be pushed out to F14?

Comment 7 Luke Hutchison 2010-11-19 03:31:28 UTC

This bug affects both Fedora 14 and Ubuntu 10.10.  It is causing major problems on my production systems when they are under memory stress.  By chance have any RedHat engineers noticed this bug report?  Thanks..

Comment 8 Kyle McMartin 2010-11-25 02:56:39 UTC

The patch isn't upstream yet...

Comment 9 Kyle McMartin 2010-11-25 03:41:09 UTC

http://userweb.kernel.org/~akpm/mmotm/broken-out/mm-vmstat-use-a-single-setter-function-and-callback-for-adjusting-percpu-thresholds-fix-set_pgdat_percpu_threshold-dont-use-for_each_online_cpu.patch

http://userweb.kernel.org/~akpm/mmotm/broken-out/mm-vmstat-use-a-single-setter-function-and-callback-for-adjusting-percpu-thresholds.patch

Comment 10 Kyle McMartin 2010-11-25 04:03:31 UTC

http://kyle.fedorapeople.org/kernel/2.6.35.9-62.bz649694.fc14/

Try this? No promises it doesn't eat your cat or set your mothers hair on fire.

Comment 11 Luke Hutchison 2010-11-25 04:21:33 UTC

Looks like there might be a third patch too that removes a spurious warning?

http://kerneltrap.org/mailarchive/linux-kernel/2010/11/14/4645187

Comment 12 Kyle McMartin 2010-11-25 04:28:02 UTC

It's in there as well.

Comment 13 Kyle McMartin 2010-11-26 17:06:30 UTC

http://kyle.fedorapeople.org/mel-mmotm-kswapd-fixes-2.6.35/
(so I don't accidentally lose them...)

Comment 14 Luke Hutchison 2010-11-28 00:36:55 UTC

Thanks Kyle.  Your kernel seems to solve the problem for me: no error messages about pegged cores when my 24-core Xeon machine with 96GB of RAM starts swapping.  Will do some more testing (although I can't test extensively on my servers because I need to get back to a kernel that works with OpenAFS so my users don't complain).

Comment 15 Kyle McMartin 2010-11-29 19:54:34 UTC

Anyone else able to test that? I don't feel entirely comfortable putting it into F14 without a few more confirmations...

Comment 16 Luke Hutchison 2010-11-29 23:03:09 UTC

The description of the patch by its author sounded like it was a bit of a stop-gap?  Is there a better long-term solution?

Comment 17 Kyle McMartin 2010-11-30 01:39:22 UTC

No, Mel told me that he didn't really expect it to solve this problem though, and was a bit surprised when I told him it had...

Comment 18 Luke Hutchison 2010-11-30 01:49:42 UTC

Er, what *is* the right solution then? High-core-count machines like the ones I administer are going to become very common very soon... Sounds like some part of the scheduler or the VM needs a bottom-up redesign?

Comment 19 Kyle McMartin 2010-12-02 14:27:36 UTC

It just goes to show that these things are not black and white. ;-)

I'm sitting on the patch for a bit, I think I'll put it into rawhide to make sure it doesn't turn up anything nasty since it's not like the problem is fixed through to Linus' git head.

Comment 20 Kyle McMartin 2010-12-02 18:34:48 UTC

Ah, it's probably safe enough. Committed to rawhide and F-14. Will be in the next builds.

Comment 21 Luke Hutchison 2010-12-02 18:43:43 UTC

Great, thanks Kyle!

Comment 22 Fedora Update System 2010-12-03 15:37:50 UTC

kernel-2.6.35.9-64.fc14 has been submitted as an update for Fedora 14.
https://admin.fedoraproject.org/updates/kernel-2.6.35.9-64.fc14

Comment 23 Fedora Update System 2010-12-05 00:41:59 UTC

kernel-2.6.35.9-64.fc14 has been pushed to the Fedora 14 stable repository.  If problems still persist, please make note of it in this bug report.

Comment 24 freemarket 2011-12-22 05:05:42 UTC

Still see this on FC14, kernel:  2.6.35.14-106.fc14.x86_64:

I know FC14 is EOL, but several Autodesk products don't do FC15 yet.

Any workarounds?
TIA,
Henry

21 23:53:34 zurich kernel: [ 1509.686601]  [<ffffffff81115f4b>] ? filp_close+0x66/0x70
Dec 21 23:53:34 zurich kernel: [ 1509.686604]  [<ffffffff81050ec2>] ? put_files_struct+0x6e/0xd5
Dec 21 23:53:34 zurich kernel: [ 1509.686606]  [<ffffffff81050fba>] ? exit_files+0x41/0x46
Dec 21 23:53:34 zurich kernel: [ 1509.686608]  [<ffffffff81051504>] ? do_exit+0x295/0x74f
Dec 21 23:53:34 zurich kernel: [ 1509.686610]  [<ffffffff81051c38>] ? do_group_exit+0x7a/0xa2
Dec 21 23:53:34 zurich kernel: [ 1509.686612]  [<ffffffff8105e751>] ? get_signal_to_deliver+0x372/0x398
Dec 21 23:53:34 zurich kernel: [ 1509.686615]  [<ffffffff81008fc9>] ? do_signal+0x72/0x690
Dec 21 23:53:34 zurich kernel: [ 1509.686617]  [<ffffffff81032221>] ? is_prefetch.clone.13+0xd5/0x1d7
Dec 21 23:53:34 zurich kernel: [ 1509.686619]  [<ffffffff8105d515>] ? send_signal+0x60/0x69
Dec 21 23:53:34 zurich kernel: [ 1509.686621]  [<ffffffff8146b667>] ? _raw_spin_unlock_irqrestore+0x17/0x19
Dec 21 23:53:34 zurich kernel: [ 1509.686623]  [<ffffffff8105d637>] ? force_sig_info+0xdc/0xee
Dec 21 23:53:34 zurich kernel: [ 1509.686625]  [<ffffffff81009628>] ? do_notify_resume+0x28/0x86
Dec 21 23:53:34 zurich kernel: [ 1509.686628]  [<ffffffff8146bb9c>] ? retint_signal+0x48/0x8c
Dec 21 23:54:30 zurich kernel: [ 1565.516203] BUG: soft lockup - CPU#6 stuck for 61s! [ksoftirqd/6:22]
Dec 21 23:54:30 zurich kernel: [ 1565.516205] Modules linked in: tcp_lp fuse nfs lockd fscache nfs_acl auth_rpcgss sunrpc ipv6 uinput snd_hda_codec_nvhdmi snd_hda_codec_realtek snd_hda_intel snd_hda_codec snd_hwdep nvidia(P) snd_seq snd_seq_device nouveau ttm drm_kms_helper drm snd_pcm xhci_hcd i2c_algo_bit video output snd_timer snd shpchp i2c_i801 r8169 i2c_core mii iTCO_wdt soundcore snd_page_alloc iTCO_vendor_support i7core_edac edac_core wmi serio_raw microcode ata_generic pata_acpi [last unloaded: scsi_wait_scan]

Note You need to log in before you can comment on or make changes to this bug.