Description of problem: When our distributed I/O load started (dd_io) I started seeing the following message on one of four nodes: BUG: soft lockup - CPU#1 stuck for 10s! [glock_workqueue:3078] Pid: 3078, comm: glock_workqueue EIP: 0060:[<c0609a70>] CPU: 1 EIP is at _spin_lock+0x7/0xf EFLAGS: 00000286 Not tainted (2.6.18-85.003 #1) EAX: de43d2dc EBX: de43d2c0 ECX: 00000286 EDX: 00000200 ESI: de43d35c EDI: f6e3d740 EBP: 00000286 DS: 007b ES: 007b CR0: 8005003b CR2: 08066204 CR3: 00726000 CR4: 000006d0 [<f8d8f3f2>] glock_work_func+0xb/0x31 [gfs2] [<c0433524>] run_workqueue+0x78/0xb5 [<f8d8f3e7>] glock_work_func+0x0/0x31 [gfs2] [<c0433dd8>] worker_thread+0xd9/0x10d [<c042028f>] default_wake_function+0x0/0xc [<c0433cff>] worker_thread+0x0/0x10d [<c04361f1>] kthread+0xc0/0xeb [<c0436131>] kthread+0x0/0xeb [<c0405c3b>] kernel_thread_helper+0x7/0x10 ======================= Version-Release number of selected component (if applicable): kernel-2.6.18-85.003 - This kernel has the patch to bug 428751 How reproducible: unknown Steps to Reproduce: 1. run dd_io with the above kernel Additional info: lock_dlm2 D 2C039F97 2916 3474 11 3484 3168 (L-TLB) f5cbcec8 00000046 000006c5 2c039f97 000006c5 f7f9daa0 00000009 f7a7c550 f7dfd000 2c03a8b9 000006c5 00000922 00000000 f7a7c65c c200c8e0 f67c3c40 00000000 00000020 00000000 0000000f c07a09b0 c07a09ac c07a09b0 c07a09ac Call Trace: [<c06084ee>] wait_for_completion+0x69/0x8d [<c042028f>] default_wake_function+0x0/0xc [<c0435fa8>] kthread_stop+0x4e/0x6c [<f8c99daa>] gdlm_withdraw+0x9c/0xb2 [lock_dlm] [<c04362bd>] autoremove_wake_function+0x0/0x2d [<f8d9490e>] gfs2_withdraw_lockproto+0x16/0x51 [gfs2] [<f8d91f90>] gfs2_lm_withdraw+0x63/0x7f [gfs2] [<f8da2cc5>] gfs2_assert_withdraw_i+0x1e/0x30 [gfs2] [<f8d8e74d>] xmote_bh+0x1c2/0x248 [gfs2] [<f8d8e850>] gfs2_glock_cb+0x7d/0xf6 [gfs2] [<f8c9a65e>] gdlm_thread+0x5b4/0x60a [lock_dlm] [<f8c9a6b4>] gdlm_thread2+0x0/0x7 [lock_dlm] [<c04361f1>] kthread+0xc0/0xeb [<c0436131>] kthread+0x0/0xeb [<c0405c3b>] kernel_thread_helper+0x7/0x10 ======================= gfs2_glockd D 8ACEFD91 2920 3484 11 3486 3474 (L-TLB) f6287efc 00000046 018ef4e4 8acefd91 000006d6 c43bd598 0000000a f7b46aa0 c06723c0 8acf1b6f 000006d6 00001dde 00000000 f7b46bac c200c8e0 f8d2f5c1 00000000 c43bd580 c43bd580 fffffffd 00000000 00000000 f7b9362c f7b9362c Call Trace: [<f8d2f5c1>] grant_pending_locks+0x62/0x137 [dlm] [<c0609721>] rwsem_down_write_failed+0x126/0x141 [<f8d308d6>] __put_lkb+0x28/0xd5 [dlm] [<c0438c15>] .text.lock.rwsem+0x2b/0x3a [<f8d92cd3>] gfs2_log_flush+0x18/0x40c [gfs2] [<c0608391>] schedule+0x90d/0x9ba [<f8d8f8c8>] inode_go_sync+0x50/0xb8 [gfs2] [<f8d8e4a0>] gfs2_glock_drop_th+0x14/0xff [gfs2] [<f8d8eacd>] run_queue+0xa6/0x236 [gfs2] [<f8d8f000>] gfs2_glmutex_unlock+0x26/0x3c [gfs2] [<f8d8f0a3>] gfs2_reclaim_glock+0x8d/0x97 [gfs2] [<f8d87457>] gfs2_glockd+0x13/0xce [gfs2] [<c04362bd>] autoremove_wake_function+0x0/0x2d [<f8d87444>] gfs2_glockd+0x0/0xce [gfs2] [<c04361f1>] kthread+0xc0/0xeb [<c0436131>] kthread+0x0/0xeb [<c0405c3b>] kernel_thread_helper+0x7/0x10 ======================= gfs2_recoverd S 1E0FB83E 3672 3486 11 3488 3484 (L-TLB) f62d8f98 00000046 00000000 1e0fb83e 00000751 f7de1c50 00000007 c2109550 c06723c0 1e0fc88b 00000751 0000104d 00000000 c210965c c200c8e0 c042dcc0 c079ee00 f62d8fa0 00000286 fffffffd 00000000 00000000 00775684 00775684 Call Trace: [<c042dcc0>] lock_timer_base+0x15/0x2f [<f8d87512>] gfs2_recoverd+0x0/0x53 [gfs2] [<c0608ad4>] schedule_timeout+0x71/0x8c [<c042d3df>] process_timeout+0x0/0x5 [<f8d87557>] gfs2_recoverd+0x45/0x53 [gfs2] [<c04361f1>] kthread+0xc0/0xeb [<c0436131>] kthread+0x0/0xeb [<c0405c3b>] kernel_thread_helper+0x7/0x10 ======================= gfs2_logd D EB8D26DA 2868 3488 11 3489 3486 (L-TLB) f6cf9eb0 00000046 f88b878d eb8d26da 000006c5 00000000 0000000a f7ba2000 c06723c0 eb8dc1b9 000006c5 00009adf 00000000 f7ba210c c200c8e0 c04d996c f6056440 c042ce42 f7f7feac fffffffd 00000000 00000000 c200c8e0 00000000 Call Trace: [<f88b878d>] dm_request+0xb5/0xd4 [dm_mod] [<c04d996c>] generic_unplug_device+0x15/0x22 [<c042ce42>] getnstimeofday+0x30/0xb6 [<c0608a31>] io_schedule+0x36/0x59 [<c0473410>] sync_buffer+0x30/0x33 [<c0608c08>] __wait_on_bit+0x33/0x58 [<c04733e0>] sync_buffer+0x0/0x33 [<c04733e0>] sync_buffer+0x0/0x33 [<c0608c8f>] out_of_line_wait_on_bit+0x62/0x6a [<c04362ea>] wake_bit_function+0x0/0x3c [<c047338d>] __wait_on_buffer+0x1c/0x1f [<c0473dbd>] sync_dirty_buffer+0x86/0xb8 [<f8d928d6>] log_write_header+0x132/0x304 [gfs2] [<f8d9301e>] gfs2_log_flush+0x363/0x40c [gfs2] [<f8d924e0>] gfs2_ail1_empty+0x13/0x7d [gfs2] [<f8d875f7>] gfs2_logd+0x92/0x13f [gfs2] [<f8d87565>] gfs2_logd+0x0/0x13f [gfs2] [<c04361f1>] kthread+0xc0/0xeb [<c0436131>] kthread+0x0/0xeb [<c0405c3b>] kernel_thread_helper+0x7/0x10 ======================= gfs2_quotad S F4749BE3 2852 3489 11 8311 3488 (L-TLB) f6324f98 00000046 f7fff680 f4749be3 00000757 f6324f84 0000000a f7bc5aa0 c06723c0 f474af1d 00000757 0000133a 00000000 f7bc5bac c200c8e0 c042dcc0 c079ee00 f6324fa0 00000286 fffffffd 00000000 00000000 0076f29e 0076f29e Call Trace: [<c042dcc0>] lock_timer_base+0x15/0x2f [<f8d876a4>] gfs2_quotad+0x0/0x12c [gfs2] [<c0608ad4>] schedule_timeout+0x71/0x8c [<c042d3df>] process_timeout+0x0/0x5 [<f8d877be>] gfs2_quotad+0x11a/0x12c [gfs2] [<c04361f1>] kthread+0xc0/0xeb [<c0436131>] kthread+0x0/0xeb [<c0405c3b>] kernel_thread_helper+0x7/0x10 ======================= pdflush S A1E85F24 2532 8311 11 3489 (L-TLB) d062afa0 00000046 c04d9a79 a1e85f24 000006a4 c04362bd 0000000a ce3b7550 c20ef550 a1e87696 000006a4 00001772 00000001 ce3b765c c20136c4 fffffff4 0000040c 00000020 00000001 00000000 00000000 00000021 00000001 d062afb8 Call Trace: [<c04d9a79>] blk_congestion_wait+0x5e/0x67 [<c04362bd>] autoremove_wake_function+0x0/0x2d [<c045acdd>] pdflush+0x0/0x1a3 [<c045ad94>] pdflush+0xb7/0x1a3 [<c04361f1>] kthread+0xc0/0xeb [<c0436131>] kthread+0x0/0xeb [<c0405c3b>] kernel_thread_helper+0x7/0x10 ======================= d_doio D EBDF4409 2376 8545 1 3180 (NOTLB) c52a5cac 00000082 00027d9f ebdf4409 000006c4 c52a5ca0 00000007 c210e000 c20ef550 ed3dce39 000006c4 015e8a30 00000001 c210e10c c20136c4 c52a5cb4 c04733e0 c0608c8f 00000002 ffffffff 00000000 00000000 c52a5cd8 00000000 Call Trace: [<c04733e0>] sync_buffer+0x0/0x33 [<c0608c8f>] out_of_line_wait_on_bit+0x62/0x6a [<f8d8dc85>] just_schedule+0x5/0x8 [gfs2] [<c0608c08>] __wait_on_bit+0x33/0x58 [<f8d8dc80>] just_schedule+0x0/0x8 [gfs2] [<f8d8dc80>] just_schedule+0x0/0x8 [gfs2] [<c0608c8f>] out_of_line_wait_on_bit+0x62/0x6a [<c04362ea>] wake_bit_function+0x0/0x3c [<f8d8dc7c>] wait_on_holder+0x27/0x2b [gfs2] [<f8d8ed29>] glock_wait_internal+0xcc/0x1d0 [gfs2] [<f8d8ef98>] gfs2_glock_nq+0x16b/0x18b [gfs2] [<f8d90013>] gfs2_glock_nq_atime+0xfa/0x2db [gfs2] [<f8d9668b>] gfs2_prepare_write+0xb5/0x32c [gfs2] [<c0456975>] generic_file_buffered_write+0x226/0x5a2 [<c0420b5e>] rebalance_tick+0x11f/0x2e4 [<c042a3e1>] current_fs_time+0x4a/0x55 [<c0457197>] __generic_file_aio_write_nolock+0x4a6/0x52a [<c04e2704>] __next_cpu+0x12/0x21 [<c041efa7>] find_busiest_group+0x177/0x462 [<c04573f5>] generic_file_write+0x0/0x94 [<c045734b>] __generic_file_write_nolock+0x86/0x9a [<c04362bd>] autoremove_wake_function+0x0/0x2d [<c0420b5e>] rebalance_tick+0x11f/0x2e4 [<c0608ce3>] mutex_lock+0xb/0x19 [<c045742f>] generic_file_write+0x3a/0x94 [<c04573f5>] generic_file_write+0x0/0x94 [<c04713ff>] vfs_write+0xa1/0x143 [<c04719f1>] sys_write+0x3c/0x63 [<c0404eff>] syscall_call+0x7/0xb =======================
How reproducible: every time Steps to Reproduce: 1. run dd_io with the above kernel, a test case with a large buffer size will trigger it.
Does this still happen with the latest build? I presume not since we've been running dd_io against it extensively and I've not had any reports of this, so perhaps we can close this one too?
I haven't seen this in a while, but let's give it the standard six month NEEDINFO treatment.
Pushing the severity down on the basis that this might well be already fixed.
*** This bug has been marked as a duplicate of 432057 ***