Description of problem: This is related to the following kernel.org bugzillas: http://bugzilla.kernel.org/show_bug.cgi?id=7903 http://bugzilla.kernel.org/show_bug.cgi?id=8077 A change was introduced in Febuary to fix hangs in i_size_read caused by calling i_size_write without holding i_mutex. The fix for this worked out in KBZ#7903 was to use the i_lock spinlock to synchronize cifs' use of i_size_xxxx(). This introduces another deadlock: cifs_get_inode_info_unix() spin_lock(&inode->i_lock); if (is_size_safe_to_change(cifsInfo, end_of_file)) { [...] } } is_size_safe_to_change() can end up sleeping via the following sequence of calls: is_size_safe_to_change+0x24/0x90 [cifs] find_writable_file+0xe4/0x184 [cifs] cifs_reopen_file+0x2e4/0x524 [cifs] CIFSSMBOpen+0x2d8/0x518 [cifs] SendReceive+0x2dc/0x598 [cifs] wait_for_response+0xe8/0x1bc [cifs] schedule+0xa98/0xbf4 If we wind up with one thread sleeping in this code path it is possible for the system to deadlock should all other CPUs enter CIFS and attempt to take this inode's i_lock. This has been seen on 2 CPU power5 systems. Version-Release number of selected component (if applicable): 2.6.18-32.el5 How reproducible: Moderate - stress test of several hours duration required. Steps to Reproduce: 1. Run LTP locktest case (testcases/network/nfsv4/locks/) for several hours: http://ltp.cvs.sourceforge.net/ltp/ltp/testcases/network/nfsv4/locks/ Actual results: After some time the system deadlocks and reports soft lockups pointing at the above CIFS code: <3>BUG: soft lockup detected on CPU#1!. <4>Call Trace:. <4>[C0000000E4D3F290] [C00000000000FFDC] .show_stack+0x68/0x1b0 (unreliable). <4>[C0000000E4D3F330] [C0000000000A68C4] .softlockup_tick+0xf0/0x13c. <4>[C0000000E4D3F3E0] [C000000000075BF8] .run_local_timers+0x1c/0x30. <4>[C0000000E4D3F460] [C0000000000235F0] .timer_interrupt+0xa8/0x498. <4>[C0000000E4D3F540] [C0000000000034F4] decrementer_common+0xf4/0x100. <4>--- Exception: 901 at ._spin_lock+0x40/0x88. <4> LR = .cifs_get_inode_info_unix+0x784/0x9c0 [cifs]. <4>[C0000000E4D3F830] [C0000000EA44EE40] 0xc0000000ea44ee40 (unreliable). <4>[C0000000E4D3F8B0] [D000000000B4BE78] .cifs_get_inode_info_unix+0x784/0x9c0 [cifs]. <4>[C0000000E4D3FA20] [D000000000B4A908] .cifs_open+0x72c/0x8d0 [cifs]. <4>[C0000000E4D3FB30] [C0000000000E8754] .__dentry_open+0x13c/0x2bc. <4>[C0000000E4D3FBE0] [C0000000000E8A48] .do_filp_open+0x50/0x70. <4>[C0000000E4D3FD00] [C0000000000E8ADC] .do_sys_open+0x74/0x130. <4>[C0000000E4D3FDB0] [C000000000128210] .compat_sys_open+0x24/0x38. <4>[C0000000E4D3FE30] [C0000000000086A4] syscall_exit+0x0/0x40. <3>BUG: soft lockup detected on CPU#0!. <4>Call Trace:. <4>[C0000000B307B290] [C00000000000FFDC] .show_stack+0x68/0x1b0 (unreliable). <4>[C0000000B307B330] [C0000000000A68C4] .softlockup_tick+0xf0/0x13c. <4>[C0000000B307B3E0] [C000000000075BF8] .run_local_timers+0x1c/0x30. <4>[C0000000B307B460] [C0000000000235F0] .timer_interrupt+0xa8/0x498. <4>[C0000000B307B540] [C0000000000034F4] decrementer_common+0xf4/0x100. <4>--- Exception: 901 at ._spin_lock+0x50/0x88. <4> LR = .cifs_get_inode_info_unix+0x784/0x9c0 [cifs]. <4>[C0000000B307B830] [D000000000B889F8] 0xd000000000b889f8 (unreliable). <4>[C0000000B307B8B0] [D000000000B4BE78] .cifs_get_inode_info_unix+0x784/0x9c0 [cifs]. <4>[C0000000B307BA20] [D000000000B4A908] .cifs_open+0x72c/0x8d0 [cifs]. <4>[C0000000B307BB30] [C0000000000E8754] .__dentry_open+0x13c/0x2bc. <4>[C0000000B307BBE0] [C0000000000E8A48] .do_filp_open+0x50/0x70. <4>[C0000000B307BD00] [C0000000000E8ADC] .do_sys_open+0x74/0x130. <4>[C0000000B307BDB0] [C000000000128210] .compat_sys_open+0x24/0x38. <4>[C0000000B307BE30] [C0000000000086A4] syscall_exit+0x0/0x40. Expected results: No deadlock, fs test runs to completion. Additional info: The original kernel.org bugzilla was triggered via multiple instances of cp/find etc. It's not clear yet if this problem can also be triggered by these simple tests but is reported to be reliably triggered by the LTP testcase.
Created attachment 159920 [details] avoid sleeping inside is_size_safe_to_change Amit's 5.1 backport of Steve French's upstream patch
Merged into cifs-2.6-git today: http://git.kernel.org/?p=linux/kernel/git/sfrench/cifs-2.6.git;a=commit;h=a403a0a370946e7dbcda6464a3509089daee54bc
This bugzilla has Keywords: Regression. Since no regressions are allowed between releases, it is also being proposed as a blocker for this release. Please resolve ASAP.
Created attachment 161345 [details] patch -- Amit's patch ported to 5.1 beta kernels Same patch as Amit's patch but fixed up to apply to 5.1 beta kernels.
For QA purposes, could you elaborate on exactly how you reproduce this issue? i.e. how are you running locktest here?
Was able to reproduce this pretty quickly by running the ltp locktest like this: # locktest -n 50 -f /file/on/cifs ...while running "service smb restart" in a loop on the server. I want to reproduce it a couple of times to make sure I can do it reliably and then I'll test whether the patch fixes it.
in 2.6.18-42.el5 You can download this test kernel from http://people.redhat.com/dzickus/el5
reproduced problem using testcase in comment 11 using the -40 kernel, was unable to reproduce it using the -45 kernel.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2007-0959.html