Description of problem: I get some BUGs in logs and the ssh connection are stuck for a lot of time. Version-Release number of selected component (if applicable): 2.6.35.6-48.fc14.x86_64 How reproducible: I have no idea how to reproduce. Steps to Reproduce: 1. 2. 3. Actual results: Stuck somewhere in the kernel. Expected results: No stuck. Additional info: free: total used free shared buffers cached Mem: 16522288 16411324 110964 0 *5111400* *48728* -/+ buffers/cache: 11251196 5271092 BUG: soft lockup - CPU#2 stuck for 61s! [kswapd0:128] Modules linked in: cpufreq_stats freq_table ipt_MASQUERADE iptable_nat nf_nat sit tunnel4 tun ebtable_nat ebtables bridge stp llc ip6t_REJECT nf_conntrack_ipv6 ip6table_filter ip6_tables ipv6 vfat fat kvm_intel kvm bnx2 i7core_edac ioatdma edac_core cdc_ether matroxfb_base matroxfb_DAC1064 matroxfb_accel matroxfb_Ti3026 matroxfb_g450 g450_pll matroxfb_misc usbnet shpchp i2c_i801 i2c_core iTCO_wdt iTCO_vendor_support dca joydev serio_raw mii usb_storage megaraid CPU 2 Modules linked in: cpufreq_stats freq_table ipt_MASQUERADE iptable_nat nf_nat sit tunnel4 tun ebtable_nat ebtables bridge stp llc ip6t_REJECT nf_conntrack_ipv6 ip6table_filter ip6_tables ipv6 vfat fat kvm_intel kvm bnx2 i7core_edac ioatdma edac_core cdc_ether matroxfb_base matroxfb_DAC1064 matroxfb_accel matroxfb_Ti3026 matroxfb_g450 g450_pll matroxfb_misc usbnet shpchp i2c_i801 i2c_core iTCO_wdt iTCO_vendor_support dca joydev serio_raw mii usb_storage megaraid Pid: 128, comm: kswapd0 Not tainted 2.6.35.6-48.fc14.x86_64 #1 69Y4438 /System x3650 M3 -[7945K3G]- RIP: 0010:[<ffffffff810e5c8d>] [<ffffffff810e5c8d>] zone_nr_free_pages+0x19/0x98 RSP: 0018:ffff8802751e1d00 EFLAGS: 00000286 RAX: 0000000000000000 RBX: ffff8802751e1d20 RCX: 0000000000000000 RDX: 0000000000000f8b RSI: 0000000000000000 RDI: ffff880100000000 RBP: ffffffff8100a68e R08: 0000000000000000 R09: ffffffff81b81f80 R10: 0000000000000000 R11: ffffffff81b81f60 R12: ffff8802751e1cf0 R13: ffffffff8100a68e R14: ffff8802751e1cb0 R15: 0000000000000060 FS: 0000000000000000(0000) GS:ffff880002040000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 00007fa6d6414c40 CR3: 0000000001a42000 CR4: 00000000000026e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process kswapd0 (pid: 128, threadinfo ffff8802751e0000, task ffff880276972e80) Stack: 0000000000000000 ffff880100000000 0000000000000000 0000000000000000 <0> ffff8802751e1d60 ffffffff810d7049 0000000000000000 ffff880200000000 <0> ffff880100000000 000000000000000c 0000000000000000 0000000000000000 Call Trace: [<ffffffff810d7049>] ? zone_watermark_ok+0x29/0xba [<ffffffff810dff5c>] ? balance_pgdat+0x16a/0x4c8 [<ffffffff810e045e>] ? kswapd+0x1a4/0x1ba [<ffffffff810663c3>] ? autoremove_wake_function+0x0/0x39 [<ffffffff810e02ba>] ? kswapd+0x0/0x1ba [<ffffffff81065f29>] ? kthread+0x7f/0x87 [<ffffffff8100aae4>] ? kernel_thread_helper+0x4/0x10 [<ffffffff81065eaa>] ? kthread+0x0/0x87 [<ffffffff8100aae0>] ? kernel_thread_helper+0x0/0x10 Code: 40 0f a3 3a 19 ff 85 ff 74 e3 48 8b 10 48 89 11 c9 c3 55 48 89 e5 41 55 41 54 53 48 83 ec 08 0f 1f 44 00 00 48 8b 97 30 05 00 00 <31> c0 48 89 fb 48 85 d2 48 0f 49 c2 48 3b 47 18 73 65 48 8b 97 Call Trace: [<ffffffff810d7049>] ? zone_watermark_ok+0x29/0xba [<ffffffff810dff5c>] ? balance_pgdat+0x16a/0x4c8 [<ffffffff810e045e>] ? kswapd+0x1a4/0x1ba [<ffffffff810663c3>] ? autoremove_wake_function+0x0/0x39 [<ffffffff810e02ba>] ? kswapd+0x0/0x1ba [<ffffffff81065f29>] ? kthread+0x7f/0x87 [<ffffffff8100aae4>] ? kernel_thread_helper+0x4/0x10 [<ffffffff81065eaa>] ? kthread+0x0/0x87 [<ffffffff8100aae0>] ? kernel_thread_helper+0x0/0x10
I'm having the same issue. 2.6.35.6-45.fc14.x86_64 I can reproduce consistently by importing a large mysql dump. # free -k total used free shared buffers cached Mem: 4114904 4090508 24396 0 57732 2577020 -/+ buffers/cache: 1455756 2659148 Swap: 0 0 0 Mysql's innodb buffer is sized at 1G. During the import mysqld has about 30% mem. I have swap disabled here but the same happens with 4G of swap. As the import progresses, kswapd starts taking progressively more cpu time and eventually these start showing up: [ 2536.178964] BUG: soft lockup - CPU#4 stuck for 61s! [kswapd1:73] [ 2536.179329] Modules linked in: ipv6 i7core_edac edac_core iTCO_wdt iTCO_vendor_support tg3 serio_raw hed raid0 raid1 [last unloaded: scsi_wait_scan] [ 2536.179339] CPU 4 [ 2536.179340] Modules linked in: ipv6 i7core_edac edac_core iTCO_wdt iTCO_vendor_support tg3 serio_raw hed raid0 raid1 [last unloaded: scsi_wait_scan] [ 2536.179348] [ 2536.179352] Pid: 73, comm: kswapd1 Not tainted 2.6.35.6-45.fc14.x86_64 #1 /ProLiant ML150 G6 [ 2536.179354] RIP: 0010:[<ffffffff8121852f>] [<ffffffff8121852f>] find_next_bit+0x93/0x9c [ 2536.179366] RSP: 0018:ffff88007bd6fcf0 EFLAGS: 00000206 [ 2536.179368] RAX: 00000000000000fc RBX: ffff88007bd6fcf0 RCX: 0000000000000002 [ 2536.179371] RDX: 0000000000000002 RSI: 0000000000000100 RDI: 0000000000000100 [ 2536.179373] RBP: ffffffff8100a68e R08: 0000000000000000 R09: ffffffff81b81f60 [ 2536.179375] R10: 0000000000000000 R11: ffffffff81b81f60 R12: ffff88007bd6fcb0 [ 2536.179377] R13: 00000000000000a0 R14: ffff88007bd6fc70 R15: ffff88007bd6fc78 [ 2536.179380] FS: 0000000000000000(0000) GS:ffff880102000000(0000) knlGS:0000000000000000 [ 2536.179383] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [ 2536.179385] CR2: 00007f4682116026 CR3: 0000000008140000 CR4: 00000000000006e0 [ 2536.179388] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 2536.179390] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [ 2536.179393] Process kswapd1 (pid: 73, threadinfo ffff88007bd6e000, task ffff88007bd71740) [ 2536.179395] Stack: [ 2536.179730] ffff88007bd6fd20 ffffffff810e5cf3 0000000000000000 ffff880100000e00 [ 2536.179734] <0> 0000000000000000 0000000000000000 ffff88007bd6fd60 ffffffff810d7049 [ 2536.179737] <0> ffffffffffffff10 ffffffff00000000 ffff880100000000 000000000000000c [ 2536.179741] Call Trace: [ 2536.180148] [<ffffffff810e5cf3>] ? zone_nr_free_pages+0x7f/0x98 [ 2536.180156] [<ffffffff810d7049>] ? zone_watermark_ok+0x29/0xba [ 2536.180160] [<ffffffff810dff5c>] ? balance_pgdat+0x16a/0x4c8 [ 2536.180164] [<ffffffff810e045e>] ? kswapd+0x1a4/0x1ba [ 2536.180170] [<ffffffff810663c3>] ? autoremove_wake_function+0x0/0x39 [ 2536.180173] [<ffffffff810e02ba>] ? kswapd+0x0/0x1ba [ 2536.180176] [<ffffffff81065f29>] ? kthread+0x7f/0x87 [ 2536.180182] [<ffffffff8100aae4>] ? kernel_thread_helper+0x4/0x10 [ 2536.180185] [<ffffffff81065eaa>] ? kthread+0x0/0x87 [ 2536.180188] [<ffffffff8100aae0>] ? kernel_thread_helper+0x0/0x10 [ 2536.180189] Code: c7 c0 ff ff ff 75 e3 48 85 ff 4c 89 c0 74 23 49 8b 01 b9 40 00 00 00 48 83 ca ff 29 f9 48 d3 ea 48 21 d0 75 06 49 8d 04 38 eb 07 <48> 0f bc c0 4c 01 c0 c9 c3 55 48 39 f2 48 89 f0 48 89 e5 0f 83 [ 2536.181502] Call Trace: [ 2536.181505] [<ffffffff810e5cf3>] ? zone_nr_free_pages+0x7f/0x98 [ 2536.181508] [<ffffffff810d7049>] ? zone_watermark_ok+0x29/0xba [ 2536.181512] [<ffffffff810dff5c>] ? balance_pgdat+0x16a/0x4c8 [ 2536.181515] [<ffffffff810e045e>] ? kswapd+0x1a4/0x1ba [ 2536.181518] [<ffffffff810663c3>] ? autoremove_wake_function+0x0/0x39 [ 2536.181522] [<ffffffff810e02ba>] ? kswapd+0x0/0x1ba [ 2536.181524] [<ffffffff81065f29>] ? kthread+0x7f/0x87 [ 2536.181527] [<ffffffff8100aae4>] ? kernel_thread_helper+0x4/0x10 [ 2536.181530] [<ffffffff81065eaa>] ? kthread+0x0/0x87 [ 2536.181533] [<ffffffff8100aae0>] ? kernel_thread_helper+0x0/0x10
Help! Eventually it hangs the machine: Nov 5 00:23:07 hesj3-m31 kernel: [1347687.723063] BUG: soft lockup - CPU#3 stuck for 61s! [kswapd0:134] Nov 5 00:23:07 hesj3-m31 kernel: [1347687.723063] Modules linked in: ip_vs libcrc32c nfsd lockd nfs_acl auth_rpcgss exportfs sunrpc ipv6 i2c_piix4 amd64_edac_mod i2c_core igb edac_core edac_mce_amd ser io_raw k10temp dca microcode jfs raid456 async_raid6_recov async_pq raid6_pq async_xor xor async_memcpy async_tx raid1 pata_acpi ata_generic usb_storage pata_atiixp mpt2sas scsi_transport_sas raid_class [last unloaded: scsi_wait_scan] Nov 5 00:23:07 hesj3-m31 kernel: [1347687.723063] CPU 3 Nov 5 00:23:07 hesj3-m31 kernel: [1347687.723063] Modules linked in: ip_vs libcrc32c nfsd lockd nfs_acl auth_rpcgss exportfs sunrpc ipv6 i2c_piix4 amd64_edac_mod i2c_core igb edac_core edac_mce_amd ser io_raw k10temp dca microcode jfs raid456 async_raid6_recov async_pq raid6_pq async_xor xor async_memcpy async_tx raid1 pata_acpi ata_generic usb_storage pata_atiixp mpt2sas scsi_transport_sas raid_class [last unloaded: scsi_wait_scan] Nov 5 00:23:07 hesj3-m31 kernel: [1347687.723063] Nov 5 00:23:07 hesj3-m31 kernel: [1347687.723063] Pid: 134, comm: kswapd0 Not tainted 2.6.35.6-43.fc14.x86_64 #1 H8DGU/H8DGU Nov 5 00:23:07 hesj3-m31 kernel: [1347687.723063] RIP: 0010:[<ffffffff81077e57>] [<ffffffff81077e57>] raw_local_irq_restore+0xb/0x12 Nov 5 00:23:07 hesj3-m31 kernel: [1347687.723063] RSP: 0018:ffff880214d25e00 EFLAGS: 00000286 Nov 5 00:23:07 hesj3-m31 kernel: [1347687.723063] RAX: ffff880100013e78 RBX: ffff880214d25e00 RCX: 000000000000e055 Nov 5 00:23:07 hesj3-m31 kernel: [1347687.723063] RDX: ffff880214db2e80 RSI: 0000000000000286 RDI: 0000000000000286 Nov 5 00:23:07 hesj3-m31 kernel: [1347687.723063] RBP: ffffffff8100a68e R08: ffff880100013e80 R09: ffffffff81b81f80 Nov 5 00:23:07 hesj3-m31 kernel: [1347687.723063] R10: 0000000000000000 R11: ffffffff81b81f60 R12: ffff880214d25e20 Nov 5 00:23:07 hesj3-m31 kernel: [1347687.723063] R13: ffffffff810dff5c R14: ffff880214d25e50 R15: 0000000000000000 Nov 5 00:23:07 hesj3-m31 kernel: [1347687.723063] FS: 00007fe9da235720(0000) GS:ffff880002060000(0000) knlGS:0000000000000000 Nov 5 00:23:07 hesj3-m31 kernel: [1347687.723063] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b Nov 5 00:23:07 hesj3-m31 kernel: [1347687.723063] CR2: 000000000386bff0 CR3: 0000000040e22000 CR4: 00000000000006e0 Nov 5 00:23:07 hesj3-m31 kernel: [1347687.723063] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 Nov 5 00:23:07 hesj3-m31 kernel: [1347687.723063] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Nov 5 00:23:07 hesj3-m31 kernel: [1347687.723063] Process kswapd0 (pid: 134, threadinfo ffff880214d24000, task ffff880214db2e80) Nov 5 00:23:07 hesj3-m31 kernel: [1347687.723063] Stack: Nov 5 00:23:07 hesj3-m31 kernel: [1347687.723063] ffff880214d25e10 ffffffff81469057 ffff880214d25e50 ffffffff81066656 Nov 5 00:23:07 hesj3-m31 kernel: [1347687.723063] <0> ffff880214d25e50 0000000000000000 ffff880100000000 0000000000000000 Nov 5 00:23:07 hesj3-m31 kernel: [1347687.723063] <0> ffff880214d25e70 ffff880100013e78 ffff880214d25ee0 ffffffff810e037a Nov 5 00:23:07 hesj3-m31 kernel: [1347687.723063] Call Trace: Nov 5 00:23:07 hesj3-m31 kernel: [1347687.723063] [<ffffffff81469057>] ? _raw_spin_unlock_irqrestore+0x17/0x19 Nov 5 00:23:07 hesj3-m31 kernel: [1347687.723063] [<ffffffff81066656>] ? prepare_to_wait+0x6c/0x79 Nov 5 00:23:07 hesj3-m31 kernel: [1347687.723063] [<ffffffff810e037a>] ? kswapd+0xc0/0x1ba Nov 5 00:23:07 hesj3-m31 kernel: [1347687.723063] [<ffffffff810663c3>] ? autoremove_wake_function+0x0/0x39 Nov 5 00:23:07 hesj3-m31 kernel: [1347687.723063] [<ffffffff810e02ba>] ? kswapd+0x0/0x1ba Nov 5 00:23:07 hesj3-m31 kernel: [1347687.723063] [<ffffffff81065f29>] ? kthread+0x7f/0x87 Nov 5 00:23:07 hesj3-m31 kernel: [1347687.723063] [<ffffffff8100aae4>] ? kernel_thread_helper+0x4/0x10 Nov 5 00:23:07 hesj3-m31 kernel: [1347687.723063] [<ffffffff81065eaa>] ? kthread+0x0/0x87 Nov 5 00:23:07 hesj3-m31 kernel: [1347687.723063] [<ffffffff8100aae0>] ? kernel_thread_helper+0x0/0x10 Nov 5 00:23:07 hesj3-m31 kernel: [1347687.723063] Code: e8 ee 11 3f 00 c9 c3 55 48 89 e5 0f 1f 44 00 00 66 ff 05 8d b3 9a 00 fb 66 66 90 66 66 90 c9 c3 55 48 89 e5 0f 1f 44 00 00 57 9d <66> 66 90 66 90 c9 c3 55 48 89 e5 0f 1f 44 00 00 fa 66 66 90 66 Nov 5 00:23:07 hesj3-m31 kernel: [1347687.723063] Call Trace: Nov 5 00:23:07 hesj3-m31 kernel: [1347687.723063] [<ffffffff81469057>] ? _raw_spin_unlock_irqrestore+0x17/0x19 Nov 5 00:23:07 hesj3-m31 kernel: [1347687.723063] [<ffffffff81066656>] ? prepare_to_wait+0x6c/0x79 Nov 5 00:23:07 hesj3-m31 kernel: [1347687.723063] [<ffffffff810e037a>] ? kswapd+0xc0/0x1ba Nov 5 00:23:07 hesj3-m31 kernel: [1347687.723063] [<ffffffff810663c3>] ? autoremove_wake_function+0x0/0x39 Nov 5 00:23:07 hesj3-m31 kernel: [1347687.723063] [<ffffffff810e02ba>] ? kswapd+0x0/0x1ba Nov 5 00:23:07 hesj3-m31 kernel: [1347687.723063] [<ffffffff81065f29>] ? kthread+0x7f/0x87 Nov 5 00:23:07 hesj3-m31 kernel: [1347687.723063] [<ffffffff8100aae4>] ? kernel_thread_helper+0x4/0x10 Nov 5 00:23:07 hesj3-m31 kernel: [1347687.723063] [<ffffffff81065eaa>] ? kthread+0x0/0x87 Nov 5 00:23:07 hesj3-m31 kernel: [1347687.723063] [<ffffffff8100aae0>] ? kernel_thread_helper+0x0/0x10 Nov 5 00:24:12 hesj3-m31 kernel: [1347753.222063] BUG: soft lockup - CPU#3 stuck for 61s! [kswapd0:134] Nov 5 00:24:12 hesj3-m31 kernel: [1347753.222063] Modules linked in: ip_vs libcrc32c nfsd lockd nfs_acl auth_rpcgss exportfs sunrpc ipv6 i2c_piix4 amd64_edac_mod i2c_core igb edac_core edac_mce_amd ser io_raw k10temp dca microcode jfs raid456 async_raid6_recov async_pq raid6_pq async_xor xor async_memcpy async_tx raid1 pata_acpi ata_generic usb_storage pata_atiixp mpt2sas scsi_transport_sas raid_class [last unloaded: scsi_wait_scan] Nov 5 00:24:12 hesj3-m31 kernel: [1347753.222063] CPU 3 Nov 5 00:24:12 hesj3-m31 kernel: [1347753.222063] Modules linked in: ip_vs libcrc32c nfsd lockd nfs_acl auth_rpcgss exportfs sunrpc ipv6 i2c_piix4 amd64_edac_mod i2c_core igb edac_core edac_mce_amd ser io_raw k10temp dca microcode jfs raid456 async_raid6_recov async_pq raid6_pq async_xor xor async_memcpy async_tx raid1 pata_acpi ata_generic usb_storage pata_atiixp mpt2sas scsi_transport_sas raid_class [last unloaded: scsi_wait_scan] Nov 5 00:24:12 hesj3-m31 kernel: [1347753.222063] Nov 5 00:24:12 hesj3-m31 kernel: [1347753.222063] Pid: 134, comm: kswapd0 Not tainted 2.6.35.6-43.fc14.x86_64 #1 H8DGU/H8DGU Nov 5 00:24:12 hesj3-m31 kernel: [1347753.222063] RIP: 0010:[<ffffffff810d70d1>] [<ffffffff810d70d1>] zone_watermark_ok+0xb1/0xba Nov 5 00:24:12 hesj3-m31 kernel: [1347753.222063] RSP: 0018:ffff880214d25e00 EFLAGS: 00000283 Nov 5 00:24:12 hesj3-m31 kernel: [1347753.222063] RAX: 0000000000000000 RBX: ffff880214d25e20 RCX: 0000000000000697 Nov 5 00:24:12 hesj3-m31 kernel: [1347753.222063] RDX: 00000000000006b4 RSI: 0000000000000286 RDI: 0000000000000000 Nov 5 00:24:12 hesj3-m31 kernel: [1347753.222063] RBP: ffffffff8100a68e R08: 0000000000000000 R09: ffff880214d25e50 Nov 5 00:24:12 hesj3-m31 kernel: [1347753.222063] R10: 0000000000000000 R11: ffffffff81b81f60 R12: ffff880214d25df8 Nov 5 00:24:12 hesj3-m31 kernel: [1347753.222063] R13: ffffffff810dff5c R14: ffff880214d25e50 R15: 0000000000000000 Nov 5 00:24:12 hesj3-m31 kernel: [1347753.222063] FS: 00007fe9da235720(0000) GS:ffff880002060000(0000) knlGS:0000000000000000 Nov 5 00:24:12 hesj3-m31 kernel: [1347753.222063] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b Nov 5 00:24:12 hesj3-m31 kernel: [1347753.222063] CR2: 000000000386bff0 CR3: 0000000040e22000 CR4: 00000000000006e0 Nov 5 00:24:12 hesj3-m31 kernel: [1347753.222063] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 Nov 5 00:24:12 hesj3-m31 kernel: [1347753.222063] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Nov 5 00:24:12 hesj3-m31 kernel: [1347753.222063] Process kswapd0 (pid: 134, threadinfo ffff880214d24000, task ffff880214db2e80) Nov 5 00:24:12 hesj3-m31 kernel: [1347753.222063] Stack: Nov 5 00:24:12 hesj3-m31 kernel: [1347753.222063] 0000000000000002 ffff880100000000 0000000000000000 ffff880100013e78 Nov 5 00:24:12 hesj3-m31 kernel: [1347753.222063] <0> ffff880214d25e50 ffffffff810dd93b ffff880100000000 ffff880100000000 Nov 5 00:24:12 hesj3-m31 kernel: [1347753.222063] <0> 0000000000000000 ffff880214d25e70 ffff880214d25ee0 ffffffff810e03fc Nov 5 00:24:12 hesj3-m31 kernel: [1347753.222063] Call Trace: Nov 5 00:24:12 hesj3-m31 kernel: [1347753.222063] [<ffffffff810dd93b>] ? sleeping_prematurely+0x55/0x76 Nov 5 00:24:12 hesj3-m31 kernel: [1347753.222063] [<ffffffff810e03fc>] ? kswapd+0x142/0x1ba Nov 5 00:24:12 hesj3-m31 kernel: [1347753.222063] [<ffffffff810663c3>] ? autoremove_wake_function+0x0/0x39 Nov 5 00:24:12 hesj3-m31 kernel: [1347753.222063] [<ffffffff810e02ba>] ? kswapd+0x0/0x1ba Nov 5 00:24:12 hesj3-m31 kernel: [1347753.222063] [<ffffffff81065f29>] ? kthread+0x7f/0x87 Nov 5 00:24:12 hesj3-m31 kernel: [1347753.222063] [<ffffffff8100aae4>] ? kernel_thread_helper+0x4/0x10 Nov 5 00:24:12 hesj3-m31 kernel: [1347753.222063] [<ffffffff81065eaa>] ? kthread+0x0/0x87 Nov 5 00:24:12 hesj3-m31 kernel: [1347753.222063] [<ffffffff8100aae0>] ? kernel_thread_helper+0x0/0x10 Nov 5 00:24:12 hesj3-m31 kernel: [1347753.222063] Code: 48 8b 93 c0 00 00 00 49 d1 fe 48 83 c3 58 48 d3 e2 48 29 d0 4c 39 f0 7e 0e ff c1 44 39 e1 7c e0 b8 01 00 00 00 eb 02 31 c0 5e 5f <5b> 41 5c 41 5d 41 5e c9 c3 55 48 89 e5 0f 1f 44 00 00 bf 03 00 Nov 5 00:24:12 hesj3-m31 kernel: [1347753.222063] Call Trace: Nov 5 00:24:12 hesj3-m31 kernel: [1347753.222063] [<ffffffff810dd93b>] ? sleeping_prematurely+0x55/0x76 Nov 5 00:24:12 hesj3-m31 kernel: [1347753.222063] [<ffffffff810e03fc>] ? kswapd+0x142/0x1ba Nov 5 00:24:12 hesj3-m31 kernel: [1347753.222063] [<ffffffff810663c3>] ? autoremove_wake_function+0x0/0x39 Nov 5 00:24:12 hesj3-m31 kernel: [1347753.222063] [<ffffffff810e02ba>] ? kswapd+0x0/0x1ba Nov 5 00:24:12 hesj3-m31 kernel: [1347753.222063] [<ffffffff81065f29>] ? kthread+0x7f/0x87 Nov 5 00:24:12 hesj3-m31 kernel: [1347753.222063] [<ffffffff8100aae4>] ? kernel_thread_helper+0x4/0x10 Nov 5 00:24:12 hesj3-m31 kernel: [1347753.222063] [<ffffffff81065eaa>] ? kthread+0x0/0x87 Nov 5 00:24:12 hesj3-m31 kernel: [1347753.222063] [<ffffffff8100aae0>] ? kernel_thread_helper+0x0/0x10 Nov 5 00:25:18 hesj3-m31 kernel: [1347818.720063] BUG: soft lockup - CPU#3 stuck for 61s! [kswapd0:134] Nov 5 00:25:18 hesj3-m31 kernel: [1347818.720063] Modules linked in: ip_vs libcrc32c nfsd lockd nfs_acl auth_rpcgss exportfs sunrpc ipv6 i2c_piix4 amd64_edac_mod i2c_core igb edac_core edac_mce_amd ser io_raw k10temp dca microcode jfs raid456 async_raid6_recov async_pq raid6_pq async_xor xor async_memcpy async_tx raid1 pata_acpi ata_generic usb_storage pata_atiixp mpt2sas scsi_transport_sas raid_class [last unloaded: scsi_wait_scan] Nov 5 00:25:18 hesj3-m31 kernel: [1347818.720063] CPU 3 Nov 5 00:25:18 hesj3-m31 kernel: [1347818.720063] Modules linked in: ip_vs libcrc32c nfsd lockd nfs_acl auth_rpcgss exportfs sunrpc ipv6 i2c_piix4 amd64_edac_mod i2c_core igb edac_core edac_mce_amd ser io_raw k10temp dca microcode jfs raid456 async_raid6_recov async_pq raid6_pq async_xor xor async_memcpy async_tx raid1 pata_acpi ata_generic usb_storage pata_atiixp mpt2sas scsi_transport_sas raid_class [last unloaded: scsi_wait_scan] Nov 5 00:25:18 hesj3-m31 kernel: [1347818.720063] Nov 5 00:25:18 hesj3-m31 kernel: [1347818.720063] Pid: 134, comm: kswapd0 Not tainted 2.6.35.6-43.fc14.x86_64 #1 H8DGU/H8DGU Nov 5 00:25:18 hesj3-m31 kernel: [1347818.720063] RIP: 0010:[<ffffffff812184ea>] [<ffffffff812184ea>] find_next_bit+0x9a/0x9c Nov 5 00:25:18 hesj3-m31 kernel: [1347818.720063] RSP: 0018:ffff880214d25cf0 EFLAGS: 00000202 Nov 5 00:25:18 hesj3-m31 kernel: [1347818.720063] RAX: 0000000000000008 RBX: ffff880214d25cf0 RCX: 0000000000000008 Nov 5 00:25:18 hesj3-m31 kernel: [1347818.720063] RDX: 0000000000000008 RSI: 0000000000000100 RDI: 0000000000000100 Nov 5 00:25:18 hesj3-m31 kernel: [1347818.720063] RBP: ffffffff8100a68e R08: 0000000000000000 R09: ffffffff81b81f60 Nov 5 00:25:18 hesj3-m31 kernel: [1347818.720063] R10: 0000000000000000 R11: ffffffff81b81f60 R12: ffff880214d25c90 Nov 5 00:25:18 hesj3-m31 kernel: [1347818.720063] R13: ffff880214db2eb8 R14: ffff880214d25d9c R15: ffff880002075570 Nov 5 00:25:18 hesj3-m31 kernel: [1347818.720063] FS: 00007fe9da235720(0000) GS:ffff880002060000(0000) knlGS:0000000000000000 Nov 5 00:25:18 hesj3-m31 kernel: [1347818.720063] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b Nov 5 00:25:18 hesj3-m31 kernel: [1347818.720063] CR2: 000000000386bff0 CR3: 0000000040e22000 CR4: 00000000000006e0 Nov 5 00:25:18 hesj3-m31 kernel: [1347818.720063] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 Nov 5 00:25:18 hesj3-m31 kernel: [1347818.720063] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Nov 5 00:25:18 hesj3-m31 kernel: [1347818.720063] Process kswapd0 (pid: 134, threadinfo ffff880214d24000, task ffff880214db2e80) Nov 5 00:25:18 hesj3-m31 kernel: [1347818.720063] Stack: Nov 5 00:25:18 hesj3-m31 kernel: [1347818.720063] ffff880214d25d20 ffffffff810e5cf3 0000000000000000 ffff880100000e00 Nov 5 00:25:18 hesj3-m31 kernel: [1347818.720063] <0> 0000000000000000 0000000000000000 ffff880214d25d60 ffffffff810d7049 Nov 5 00:25:18 hesj3-m31 kernel: [1347818.720063] <0> 0000000000000000 ffff880200000000 ffff880100000000 000000000000000c Nov 5 00:25:18 hesj3-m31 kernel: [1347818.720063] Call Trace: Nov 5 00:25:18 hesj3-m31 kernel: [1347818.720063] [<ffffffff810e5cf3>] ? zone_nr_free_pages+0x7f/0x98 Nov 5 00:25:18 hesj3-m31 kernel: [1347818.720063] [<ffffffff810d7049>] ? zone_watermark_ok+0x29/0xba Nov 5 00:25:18 hesj3-m31 kernel: [1347818.720063] [<ffffffff810dff5c>] ? balance_pgdat+0x16a/0x4c8 Nov 5 00:25:18 hesj3-m31 kernel: [1347818.720063] [<ffffffff810e045e>] ? kswapd+0x1a4/0x1ba Nov 5 00:25:18 hesj3-m31 kernel: [1347818.720063] [<ffffffff810663c3>] ? autoremove_wake_function+0x0/0x39 Nov 5 00:25:18 hesj3-m31 kernel: [1347818.720063] [<ffffffff810e02ba>] ? kswapd+0x0/0x1ba Nov 5 00:25:18 hesj3-m31 kernel: [1347818.720063] [<ffffffff81065f29>] ? kthread+0x7f/0x87 Nov 5 00:25:18 hesj3-m31 kernel: [1347818.720063] [<ffffffff8100aae4>] ? kernel_thread_helper+0x4/0x10 Nov 5 00:25:18 hesj3-m31 kernel: [1347818.720063] [<ffffffff81065eaa>] ? kthread+0x0/0x87 Nov 5 00:25:18 hesj3-m31 kernel: [1347818.720063] [<ffffffff8100aae0>] ? kernel_thread_helper+0x0/0x10 Nov 5 00:25:18 hesj3-m31 kernel: [1347818.720063] Code: 48 85 ff 4c 89 c0 74 23 49 8b 01 b9 40 00 00 00 48 83 ca ff 29 f9 48 d3 ea 48 21 d0 75 06 49 8d 04 38 eb 07 48 0f bc c0 4c 01 c0 <c9> c3 55 48 39 f2 48 89 f0 48 89 e5 0f 83 94 00 00 00 48 89 d1 Nov 5 00:25:18 hesj3-m31 kernel: [1347818.720063] Call Trace: Nov 5 00:25:18 hesj3-m31 kernel: [1347818.720063] [<ffffffff810e5cf3>] ? zone_nr_free_pages+0x7f/0x98 Nov 5 00:25:18 hesj3-m31 kernel: [1347818.720063] [<ffffffff810d7049>] ? zone_watermark_ok+0x29/0xba Nov 5 00:25:18 hesj3-m31 kernel: [1347818.720063] [<ffffffff810dff5c>] ? balance_pgdat+0x16a/0x4c8 Nov 5 00:25:18 hesj3-m31 kernel: [1347818.720063] [<ffffffff810e045e>] ? kswapd+0x1a4/0x1ba Nov 5 00:25:18 hesj3-m31 kernel: [1347818.720063] [<ffffffff810663c3>] ? autoremove_wake_function+0x0/0x39 Nov 5 00:25:18 hesj3-m31 kernel: [1347818.720063] [<ffffffff810e02ba>] ? kswapd+0x0/0x1ba Nov 5 00:25:18 hesj3-m31 kernel: [1347818.720063] [<ffffffff81065f29>] ? kthread+0x7f/0x87 Nov 5 00:25:18 hesj3-m31 kernel: [1347818.720063] [<ffffffff8100aae4>] ? kernel_thread_helper+0x4/0x10 Nov 5 00:25:18 hesj3-m31 kernel: [1347818.720063] [<ffffffff81065eaa>] ? kthread+0x0/0x87 Nov 5 00:25:18 hesj3-m31 kernel: [1347818.720063] [<ffffffff8100aae0>] ? kernel_thread_helper+0x0/0x10 I rebooted with " acpi=off noapic nolapic " to try to keep the machine running so I could move 1 TB of small files to/from it. # cat /proc/cpuinfo processor : 0 vendor_id : AuthenticAMD cpu family : 16 model : 9 model name : AMD Opteron(tm) Processor 6172 stepping : 1 cpu MHz : 2100.115 cache size : 512 KB physical id : 0 siblings : 1 core id : 0 cpu cores : 1 apicid : 16 initial apicid : 0 fpu : yes fpu_exception : yes cpuid level : 5 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc rep_good nonstop_tsc amd_dcm pni monitor cx16 popcnt lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt nodeid_msr npt lbrv svm_lock nrip_save bogomips : 4200.22 TLB size : 1024 4K pages clflush size : 64 cache_alignment : 64 address sizes : 48 bits physical, 48 bits virtual power management: ts ttp tm stc 100mhzsteps hwpstate Is there a way to hot-add and hot-remove more cores? Thanks.
Created attachment 458967 [details] dmesg output for the "CPU stuck" error. Same problem on a 12-way Xeon system. The system was out of RAM and swapping out to disk, and after compiling some code directly on an AFS share, this error occurred multiple times (but the timeouts happened in different functions, see the attachment cpustuck.txt).
My /proc/cpuinfo for one core, in case it's useful: processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 44 model name : Intel(R) Xeon(R) CPU X5680 @ 3.33GHz stepping : 2 cpu MHz : 1600.000 cache size : 12288 KB physical id : 0 siblings : 12 core id : 0 cpu cores : 6 apicid : 0 initial apicid : 0 fpu : yes fpu_exception : yes cpuid level : 11 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm dca sse4_1 sse4_2 popcnt aes lahf_lm ida arat tpr_shadow vnmi flexpriority ept vpid bogomips : 6649.76 clflush size : 64 cache_alignment : 64 address sizes : 40 bits physical, 48 bits virtual power management: Do you need lspci output or anything else?
This may or may not be important, but my swap partition is actually in a logical volume on top of a hardware RAID array. (This error seems to happen when the system is swapping?)
This appears to have been fixed upstream: http://kerneltrap.org/mailarchive/linux-kernel/2010/10/27/4637977 Can this fix please be pushed out to F14?
This bug affects both Fedora 14 and Ubuntu 10.10. It is causing major problems on my production systems when they are under memory stress. By chance have any RedHat engineers noticed this bug report? Thanks..
The patch isn't upstream yet...
http://userweb.kernel.org/~akpm/mmotm/broken-out/mm-vmstat-use-a-single-setter-function-and-callback-for-adjusting-percpu-thresholds-fix-set_pgdat_percpu_threshold-dont-use-for_each_online_cpu.patch http://userweb.kernel.org/~akpm/mmotm/broken-out/mm-vmstat-use-a-single-setter-function-and-callback-for-adjusting-percpu-thresholds.patch
http://kyle.fedorapeople.org/kernel/2.6.35.9-62.bz649694.fc14/ Try this? No promises it doesn't eat your cat or set your mothers hair on fire.
Looks like there might be a third patch too that removes a spurious warning? http://kerneltrap.org/mailarchive/linux-kernel/2010/11/14/4645187
It's in there as well.
http://kyle.fedorapeople.org/mel-mmotm-kswapd-fixes-2.6.35/ (so I don't accidentally lose them...)
Thanks Kyle. Your kernel seems to solve the problem for me: no error messages about pegged cores when my 24-core Xeon machine with 96GB of RAM starts swapping. Will do some more testing (although I can't test extensively on my servers because I need to get back to a kernel that works with OpenAFS so my users don't complain).
Anyone else able to test that? I don't feel entirely comfortable putting it into F14 without a few more confirmations...
The description of the patch by its author sounded like it was a bit of a stop-gap? Is there a better long-term solution?
No, Mel told me that he didn't really expect it to solve this problem though, and was a bit surprised when I told him it had...
Er, what *is* the right solution then? High-core-count machines like the ones I administer are going to become very common very soon... Sounds like some part of the scheduler or the VM needs a bottom-up redesign?
It just goes to show that these things are not black and white. ;-) I'm sitting on the patch for a bit, I think I'll put it into rawhide to make sure it doesn't turn up anything nasty since it's not like the problem is fixed through to Linus' git head.
Ah, it's probably safe enough. Committed to rawhide and F-14. Will be in the next builds.
Great, thanks Kyle!
kernel-2.6.35.9-64.fc14 has been submitted as an update for Fedora 14. https://admin.fedoraproject.org/updates/kernel-2.6.35.9-64.fc14
kernel-2.6.35.9-64.fc14 has been pushed to the Fedora 14 stable repository. If problems still persist, please make note of it in this bug report.
Still see this on FC14, kernel: 2.6.35.14-106.fc14.x86_64: I know FC14 is EOL, but several Autodesk products don't do FC15 yet. Any workarounds? TIA, Henry 21 23:53:34 zurich kernel: [ 1509.686601] [<ffffffff81115f4b>] ? filp_close+0x66/0x70 Dec 21 23:53:34 zurich kernel: [ 1509.686604] [<ffffffff81050ec2>] ? put_files_struct+0x6e/0xd5 Dec 21 23:53:34 zurich kernel: [ 1509.686606] [<ffffffff81050fba>] ? exit_files+0x41/0x46 Dec 21 23:53:34 zurich kernel: [ 1509.686608] [<ffffffff81051504>] ? do_exit+0x295/0x74f Dec 21 23:53:34 zurich kernel: [ 1509.686610] [<ffffffff81051c38>] ? do_group_exit+0x7a/0xa2 Dec 21 23:53:34 zurich kernel: [ 1509.686612] [<ffffffff8105e751>] ? get_signal_to_deliver+0x372/0x398 Dec 21 23:53:34 zurich kernel: [ 1509.686615] [<ffffffff81008fc9>] ? do_signal+0x72/0x690 Dec 21 23:53:34 zurich kernel: [ 1509.686617] [<ffffffff81032221>] ? is_prefetch.clone.13+0xd5/0x1d7 Dec 21 23:53:34 zurich kernel: [ 1509.686619] [<ffffffff8105d515>] ? send_signal+0x60/0x69 Dec 21 23:53:34 zurich kernel: [ 1509.686621] [<ffffffff8146b667>] ? _raw_spin_unlock_irqrestore+0x17/0x19 Dec 21 23:53:34 zurich kernel: [ 1509.686623] [<ffffffff8105d637>] ? force_sig_info+0xdc/0xee Dec 21 23:53:34 zurich kernel: [ 1509.686625] [<ffffffff81009628>] ? do_notify_resume+0x28/0x86 Dec 21 23:53:34 zurich kernel: [ 1509.686628] [<ffffffff8146bb9c>] ? retint_signal+0x48/0x8c Dec 21 23:54:30 zurich kernel: [ 1565.516203] BUG: soft lockup - CPU#6 stuck for 61s! [ksoftirqd/6:22] Dec 21 23:54:30 zurich kernel: [ 1565.516205] Modules linked in: tcp_lp fuse nfs lockd fscache nfs_acl auth_rpcgss sunrpc ipv6 uinput snd_hda_codec_nvhdmi snd_hda_codec_realtek snd_hda_intel snd_hda_codec snd_hwdep nvidia(P) snd_seq snd_seq_device nouveau ttm drm_kms_helper drm snd_pcm xhci_hcd i2c_algo_bit video output snd_timer snd shpchp i2c_i801 r8169 i2c_core mii iTCO_wdt soundcore snd_page_alloc iTCO_vendor_support i7core_edac edac_core wmi serio_raw microcode ata_generic pata_acpi [last unloaded: scsi_wait_scan]