Bug 1770404
| Summary: | CIFS client system hung with all cpus at 100% inside smb2_reconnect spinning on TCP_Server_Info.response_q.lock spinlock | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux 7 | Reporter: | Dave Wysochanski <dwysocha> |
| Component: | kernel | Assignee: | Dave Wysochanski <dwysocha> |
| kernel sub component: | CIFS | QA Contact: | xiaoli feng <xifeng> |
| Status: | CLOSED ERRATA | Docs Contact: | |
| Severity: | high | ||
| Priority: | unspecified | CC: | cifs-maint, fsorenso, lsahlber, tborcin, xzhou |
| Version: | 7.7 | Keywords: | Reproducer |
| Target Milestone: | rc | ||
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | kernel-3.10.0-1119.el7 | Doc Type: | If docs needed, set a value |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2020-03-31 19:35:47 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Dave Wysochanski
2019-11-08 22:38:55 UTC
I actually think I can reproduce this now but I'll have to work on refinements next week and see if I can get it down to a script that is 100% reliable.
I think it requires the 'find' processes to be stuck like this first:
crash> ps -m find
[0 00:00:06.454] [IN] PID: 11276 TASK: ffff9b60b677b150 CPU: 6 COMMAND: "find"
[0 00:00:06.493] [IN] PID: 11282 TASK: ffff9b60b5c88000 CPU: 3 COMMAND: "find"
[0 00:00:06.493] [IN] PID: 11281 TASK: ffff9b61b95be2a0 CPU: 4 COMMAND: "find"
[0 00:00:06.497] [IN] PID: 11275 TASK: ffff9b612fcd62a0 CPU: 2 COMMAND: "find"
[0 00:00:06.509] [IN] PID: 11272 TASK: ffff9b60b6779070 CPU: 3 COMMAND: "find"
[0 00:00:06.509] [IN] PID: 11283 TASK: ffff9b60b5d7d230 CPU: 0 COMMAND: "find"
[0 00:00:06.509] [IN] PID: 11280 TASK: ffff9b612fcd1070 CPU: 5 COMMAND: "find"
[0 00:00:06.509] [IN] PID: 11279 TASK: ffff9b60b6468000 CPU: 2 COMMAND: "find"
[0 00:00:06.525] [IN] PID: 11278 TASK: ffff9b60b5d7c1c0 CPU: 7 COMMAND: "find"
[0 00:00:06.525] [IN] PID: 11274 TASK: ffff9b61b869d230 CPU: 4 COMMAND: "find"
crash> bt 11274
PID: 11274 TASK: ffff9b61b869d230 CPU: 4 COMMAND: "find"
#0 [ffff9b61b6763a18] __schedule at ffffffffb3b848aa
#1 [ffff9b61b6763aa8] schedule at ffffffffb3b84d59
#2 [ffff9b61b6763ab8] schedule_timeout at ffffffffb3b827a8
#3 [ffff9b61b6763b60] smb2_reconnect at ffffffffc0460167 [cifs]
#4 [ffff9b61b6763bf0] smb2_plain_req_init at ffffffffc04604ae [cifs]
#5 [ffff9b61b6763c30] SMB2_open at ffffffffc0461e0f [cifs]
#6 [ffff9b61b6763d50] smb2_query_dir_first at ffffffffc0458607 [cifs]
#7 [ffff9b61b6763dd8] initiate_cifs_search at ffffffffc0450bfd [cifs]
#8 [ffff9b61b6763e38] cifs_readdir at ffffffffc0451b2b [cifs]
#9 [ffff9b61b6763ea8] iterate_dir at ffffffffb3662677
#10 [ffff9b61b6763ee0] sys_getdents at ffffffffb3662b92
#11 [ffff9b61b6763f50] system_call_fastpath at ffffffffb3b91ed2
RIP: 00007ff7a06db275 RSP: 00007ffc085f5b78 RFLAGS: 00000246
RAX: 000000000000004e RBX: 000000000220e4f0 RCX: ffffffffb3b91e15
RDX: 0000000000008000 RSI: 000000000220e4f0 RDI: 0000000000000005
RBP: 000000000220e4f0 R8: 0000000000000001 R9: 0000000000000018
R10: 0000000000000100 R11: 0000000000000246 R12: fffffffffffffe80
R13: 000000000000000b R14: 0000000000000001 R15: 0000000000000000
ORIG_RAX: 000000000000004e CS: 0033 SS: 002b
crash> ps | grep cifsd
11224 2 3 ffff9b60b5c8e2a0 UN 0.0 0 0 [cifsd]
crash> bt 11224
PID: 11224 TASK: ffff9b60b5c8e2a0 CPU: 3 COMMAND: "cifsd"
#0 [ffff9b6119e2f9a8] __schedule at ffffffffb3b848aa
#1 [ffff9b6119e2fa38] schedule at ffffffffb3b84d59
#2 [ffff9b6119e2fa48] schedule_timeout at ffffffffb3b827a8
#3 [ffff9b6119e2faf8] sk_wait_data at ffffffffb3a37b69
#4 [ffff9b6119e2fb58] tcp_recvmsg at ffffffffb3aac7c3
#5 [ffff9b6119e2fbf8] inet_recvmsg at ffffffffb3adb5b0
#6 [ffff9b6119e2fc28] sock_recvmsg at ffffffffb3a324f5
#7 [ffff9b6119e2fd98] kernel_recvmsg at ffffffffb3a3256a
#8 [ffff9b6119e2fdb8] cifs_readv_from_socket at ffffffffc0430144 [cifs]
#9 [ffff9b6119e2fe58] cifs_demultiplex_thread at ffffffffc043066c [cifs]
#10 [ffff9b6119e2fec8] kthread at ffffffffb34c6451
Then issue the 'killall find' and the machine will go bonkers.
Forgot to put the code in there
PID: 9273 TASK: ffff8bdaf61762a0 CPU: 0 COMMAND: "find"
[exception RIP: __pv_queued_spin_lock_slowpath+156]
RIP: ffffffff9db17a1c RSP: ffff8bdb6eedbac8 RFLAGS: 00000046
RAX: 0000000000000000 RBX: ffff8bdbf723e168 RCX: 0000000000000001
RDX: 0000000000190000 RSI: 0000000000000000 RDI: ffff8bdbf723e168
RBP: ffff8bdb6eedbaf8 R8: ffff8bdb6eedbba8 R9: 000000000000004c
R10: 0000000000000000 R11: 0000000000000001 R12: ffff8bdbffc1b8c0
R13: 0000000000000001 R14: 0000000000010000 R15: ffff8bdaf61762a0
CS: 0010 SS: 0018
#0 [ffff8bdb6eedbb00] queued_spin_lock_slowpath at ffffffff9e178fd0
#1 [ffff8bdb6eedbb10] _raw_spin_lock_irqsave at ffffffff9e187707
#2 [ffff8bdb6eedbb28] prepare_to_wait at ffffffff9dac7007
#3 [ffff8bdb6eedbb60] smb2_reconnect at ffffffffc05df146 [cifs]
crash> dis -lr ffffffffc05df146 | tail --lines=20
0xffffffffc05df0ef <smb2_reconnect+175>: mov %rax,%r15
/usr/src/debug/kernel-3.10.0-1109.el7/linux-3.10.0-1109.el7.x86_64/fs/cifs/smb2pdu.c: 212
0xffffffffc05df0f2 <smb2_reconnect+178>: mov %rcx,-0x78(%rbp)
0xffffffffc05df0f6 <smb2_reconnect+182>: jmpq 0xffffffffc05df18d <smb2_reconnect+333>
0xffffffffc05df0fb <smb2_reconnect+187>: nopl 0x0(%rax,%rax,1)
0xffffffffc05df100 <smb2_reconnect+192>: mov -0x70(%rbp),%rax
0xffffffffc05df104 <smb2_reconnect+196>: lea 0x168(%r12),%r14
0xffffffffc05df10c <smb2_reconnect+204>: movq $0x0,-0x58(%rbp)
0xffffffffc05df114 <smb2_reconnect+212>: movq $0xffffffff9dac7540,-0x48(%rbp)
0xffffffffc05df11c <smb2_reconnect+220>: mov %rax,-0x50(%rbp)
0xffffffffc05df120 <smb2_reconnect+224>: mov -0x78(%rbp),%rax
0xffffffffc05df124 <smb2_reconnect+228>: mov %rax,-0x40(%rbp)
0xffffffffc05df128 <smb2_reconnect+232>: mov %rax,-0x38(%rbp)
0xffffffffc05df12c <smb2_reconnect+236>: mov $0x2710,%eax
0xffffffffc05df131 <smb2_reconnect+241>: lea -0x58(%rbp),%rsi
0xffffffffc05df135 <smb2_reconnect+245>: mov $0x1,%edx
0xffffffffc05df13a <smb2_reconnect+250>: mov %r14,%rdi
0xffffffffc05df13d <smb2_reconnect+253>: mov %rax,-0x60(%rbp)
0xffffffffc05df141 <smb2_reconnect+257>: callq 0xffffffff9dac6fe0 <prepare_to_wait>
0xffffffffc05df146 <smb2_reconnect+262>: cmpl $0x3,0x48(%r12)
152 static int
153 smb2_reconnect(__le16 smb2_command, struct cifs_tcon *tcon)
154 {
155 int rc = 0;
156 struct nls_table *nls_codepage;
157 struct cifs_ses *ses;
158 struct TCP_Server_Info *server;
159
...
192 /*
193 * Give demultiplex thread up to 10 seconds to reconnect, should be
194 * greater than cifs socket timeout which is 7 seconds
195 */
196 while (server->tcpStatus == CifsNeedReconnect) {
197 /*
198 * Return to caller for TREE_DISCONNECT and LOGOFF and CLOSE
199 * here since they are implicitly done when session drops.
200 */
201 switch (smb2_command) {
202 /*
203 * BB Should we keep oplock break and add flush to exceptions?
204 */
205 case SMB2_TREE_DISCONNECT:
206 case SMB2_CANCEL:
207 case SMB2_CLOSE:
208 case SMB2_OPLOCK_BREAK:
209 return -EAGAIN;
210 }
211
212--> wait_event_interruptible_timeout(server->response_q,
213 (server->tcpStatus != CifsNeedReconnect), 10 * HZ);
214
215 /* are we still trying to reconnect? */
216 if (server->tcpStatus != CifsNeedReconnect)
217 break;
218
219 /*
220 * on "soft" mounts we wait once. Hard mounts keep
221 * retrying until process is killed or server comes
222 * back on-line
223 */
224 if (!tcon->retry) {
225 cifs_dbg(FYI, "gave up waiting on reconnect in smb_init\n");
226 return -EHOSTDOWN;
227 }
228 }
This is what we appear to need here:
commit 7ffbe65578b44fafdef577a360eb0583929f7c6e
Author: Paulo Alcantara <paulo>
Date: Thu Jul 5 13:46:34 2018 -0300
cifs: Fix infinite loop when using hard mount option
For every request we send, whether it is SMB1 or SMB2+, we attempt to
reconnect tcon (cifs_reconnect_tcon or smb2_reconnect) before carrying
out the request.
So, while server->tcpStatus != CifsNeedReconnect, we wait for the
reconnection to succeed on wait_event_interruptible_timeout(). If it
returns, that means that either the condition was evaluated to true, or
timeout elapsed, or it was interrupted by a signal.
Since we're not handling the case where the process woke up due to a
received signal (-ERESTARTSYS), the next call to
wait_event_interruptible_timeout() will _always_ fail and we end up
looping forever inside either cifs_reconnect_tcon() or smb2_reconnect().
Here's an example of how to trigger that:
$ mount.cifs //foo/share /mnt/test -o
username=foo,password=foo,vers=1.0,hard
(break connection to server before executing bellow cmd)
$ stat -f /mnt/test & sleep 140
[1] 2511
$ ps -aux -q 2511
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 2511 0.0 0.0 12892 1008 pts/0 S 12:24 0:00 stat -f
/mnt/test
$ kill -9 2511
(wait for a while; process is stuck in the kernel)
$ ps -aux -q 2511
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 2511 83.2 0.0 12892 1008 pts/0 R 12:24 30:01 stat -f
/mnt/test
By using 'hard' mount point means that cifs.ko will keep retrying
indefinitely, however we must allow the process to be killed otherwise
it would hang the system.
Signed-off-by: Paulo Alcantara <palcantara>
Cc: stable.org
Reviewed-by: Aurelien Aptel <aaptel>
Signed-off-by: Steve French <stfrench>
The above commit backports cleanly on top of 1109.el7 and there's no references upstream to that commit (i.e. a regression) and it's CC stable so I'm going to backport it. Created attachment 1634441 [details]
test case for this bug that reproduces panic in around 60 seconds with unpatched 3.10.0-1109.el7
FWIW, this bug goes back to RHEL7.0 so it's not a regression. It only reproduces with a non-default mount option ('hard'), but still seems like a good candidate for RHEL7. The patch was in RHEL8.0 so the problem is unique to RHEL7.
Patch(es) committed on kernel-3.10.0-1119.el7 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:1016 |