Description of problem: I happened upon this problem while using the reproducer for https://bugzilla.redhat.com/show_bug.cgi?id=1614201#c21 and then doing a "killall find" while I was installing a new kernel. Basically, the test runs 'find' on a SMB share and then on the server restarts smb service in a loop. Somewhere along the way all CPUs go to 100% with a process spinning inside this loop in smb2_reconnect. Because all CPUs are stuck here I don't think the system can complete the reconnect so it's fully stuck - the runq's are huge and just 'find' processes dominate the CPUs and are stuck here. crash> sys KERNEL: /cores/retrace/repos/kernel/x86_64/usr/lib/debug/lib/modules/3.10.0-1109.el7.x86_64/vmlinux DUMPFILE: /cores/retrace/tasks/403593259/crash/vmcore CPUS: 4 DATE: Wed Nov 6 11:26:34 2019 UPTIME: 03:35:24 LOAD AVERAGE: 51.17, 26.51, 13.44 TASKS: 174 NODENAME: rhel7u7-node1.dwysocha.net RELEASE: 3.10.0-1109.el7.x86_64 VERSION: #1 SMP Sat Nov 2 11:57:33 EDT 2019 MACHINE: x86_64 (1795 Mhz) MEMORY: 4 GB PANIC: "" crash> ps | grep \> > 9191 1 3 ffff8bdbf8135230 RU 0.0 120420 1584 find > 9234 1 2 ffff8bdbf9551070 RU 0.0 120512 1624 find > 9235 1 1 ffff8bdb747bd230 RU 0.0 120512 1584 find > 9273 1 0 ffff8bdaf61762a0 RU 0.0 120344 1496 find crash> foreach 9191 9234 9235 9273 bt | grep -A 2 -B 10 raw_spin PID: 9191 TASK: ffff8bdbf8135230 CPU: 3 COMMAND: "find" [exception RIP: __pv_queued_spin_lock_slowpath+242] RIP: ffffffff9db17a72 RSP: ffff8bdb6ee2fad0 RFLAGS: 00000002 RAX: 0000000000000000 RBX: ffff8bdbf723e168 RCX: 0000000000007b6c RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff8bdbffc1b8c0 RBP: ffff8bdb6ee2fb00 R8: ffff8bdb6eedbba8 R9: 000000000000004c R10: 0000000000000000 R11: 0000000000000001 R12: ffff8bdbffd9b8c0 R13: 0000000000000001 R14: 0000000000190000 R15: 0000000000000000 CS: 0010 SS: 0018 #0 [ffff8bdb6ee2fb08] queued_spin_lock_slowpath at ffffffff9e178fd0 #1 [ffff8bdb6ee2fb18] _raw_spin_lock_irqsave at ffffffff9e187707 #2 [ffff8bdb6ee2fb30] finish_wait at ffffffff9dac7178 #3 [ffff8bdb6ee2fb60] smb2_reconnect at ffffffffc05df178 [cifs] -- PID: 9234 TASK: ffff8bdbf9551070 CPU: 2 COMMAND: "find" [exception RIP: __pv_queued_spin_lock_slowpath+572] RIP: ffffffff9db17bbc RSP: ffff8bdb714176f0 RFLAGS: 00000002 RAX: 0000000000007d29 RBX: ffff8bdbffd9b8c0 RCX: 0000000000000001 RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8bdbffd9b8c0 RBP: ffff8bdb71417720 R8: ffff8bdb6ee2fba8 R9: 000000000000004c R10: 0000000000000000 R11: 0000000000000001 R12: ffff8bdbffd1b8c0 R13: ffff8bdbffd1b904 R14: 0000000000110000 R15: 0000000000000001 CS: 0010 SS: 0018 #0 [ffff8bdb71417728] queued_spin_lock_slowpath at ffffffff9e178fd0 #1 [ffff8bdb71417738] _raw_spin_lock_irqsave at ffffffff9e187707 #2 [ffff8bdb71417750] finish_wait at ffffffff9dac7178 #3 [ffff8bdb71417780] smb2_reconnect at ffffffffc05df178 [cifs] -- PID: 9235 TASK: ffff8bdb747bd230 CPU: 1 COMMAND: "find" [exception RIP: __pv_queued_spin_lock_slowpath+572] RIP: ffffffff9db17bbc RSP: ffff8bdb6ec83708 RFLAGS: 00000002 RAX: 0000000000007f16 RBX: ffff8bdbffd1b8c0 RCX: 0000000000000001 RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8bdbffd1b8c0 RBP: ffff8bdb6ec83738 R8: ffff8bdb6ec837e8 R9: 000000000000004c R10: 0000000000000000 R11: 0000000000000001 R12: ffff8bdbffc9b8c0 R13: ffff8bdbffc9b904 R14: 0000000000090000 R15: 0000000000000001 CS: 0010 SS: 0000 #0 [ffff8bdb6ec83740] queued_spin_lock_slowpath at ffffffff9e178fd0 #1 [ffff8bdb6ec83750] _raw_spin_lock_irqsave at ffffffff9e187707 #2 [ffff8bdb6ec83768] prepare_to_wait at ffffffff9dac7007 #3 [ffff8bdb6ec837a0] smb2_reconnect at ffffffffc05df146 [cifs] -- PID: 9273 TASK: ffff8bdaf61762a0 CPU: 0 COMMAND: "find" [exception RIP: __pv_queued_spin_lock_slowpath+156] RIP: ffffffff9db17a1c RSP: ffff8bdb6eedbac8 RFLAGS: 00000046 RAX: 0000000000000000 RBX: ffff8bdbf723e168 RCX: 0000000000000001 RDX: 0000000000190000 RSI: 0000000000000000 RDI: ffff8bdbf723e168 RBP: ffff8bdb6eedbaf8 R8: ffff8bdb6eedbba8 R9: 000000000000004c R10: 0000000000000000 R11: 0000000000000001 R12: ffff8bdbffc1b8c0 R13: 0000000000000001 R14: 0000000000010000 R15: ffff8bdaf61762a0 CS: 0010 SS: 0018 #0 [ffff8bdb6eedbb00] queued_spin_lock_slowpath at ffffffff9e178fd0 #1 [ffff8bdb6eedbb10] _raw_spin_lock_irqsave at ffffffff9e187707 #2 [ffff8bdb6eedbb28] prepare_to_wait at ffffffff9dac7007 #3 [ffff8bdb6eedbb60] smb2_reconnect at ffffffffc05df146 [cifs] crash> dis -lr ffffffffc05df146 | tail --lines=20 0xffffffffc05df0ef <smb2_reconnect+175>: mov %rax,%r15 /usr/src/debug/kernel-3.10.0-1109.el7/linux-3.10.0-1109.el7.x86_64/fs/cifs/smb2pdu.c: 212 0xffffffffc05df0f2 <smb2_reconnect+178>: mov %rcx,-0x78(%rbp) 0xffffffffc05df0f6 <smb2_reconnect+182>: jmpq 0xffffffffc05df18d <smb2_reconnect+333> 0xffffffffc05df0fb <smb2_reconnect+187>: nopl 0x0(%rax,%rax,1) 0xffffffffc05df100 <smb2_reconnect+192>: mov -0x70(%rbp),%rax 0xffffffffc05df104 <smb2_reconnect+196>: lea 0x168(%r12),%r14 0xffffffffc05df10c <smb2_reconnect+204>: movq $0x0,-0x58(%rbp) 0xffffffffc05df114 <smb2_reconnect+212>: movq $0xffffffff9dac7540,-0x48(%rbp) 0xffffffffc05df11c <smb2_reconnect+220>: mov %rax,-0x50(%rbp) 0xffffffffc05df120 <smb2_reconnect+224>: mov -0x78(%rbp),%rax 0xffffffffc05df124 <smb2_reconnect+228>: mov %rax,-0x40(%rbp) 0xffffffffc05df128 <smb2_reconnect+232>: mov %rax,-0x38(%rbp) 0xffffffffc05df12c <smb2_reconnect+236>: mov $0x2710,%eax 0xffffffffc05df131 <smb2_reconnect+241>: lea -0x58(%rbp),%rsi 0xffffffffc05df135 <smb2_reconnect+245>: mov $0x1,%edx 0xffffffffc05df13a <smb2_reconnect+250>: mov %r14,%rdi 0xffffffffc05df13d <smb2_reconnect+253>: mov %rax,-0x60(%rbp) 0xffffffffc05df141 <smb2_reconnect+257>: callq 0xffffffff9dac6fe0 <prepare_to_wait> 0xffffffffc05df146 <smb2_reconnect+262>: cmpl $0x3,0x48(%r12) crash> runq -c 0 CPU 0 RUNQUEUE: ffff8bdbffc1acc0 CURRENT: PID: 9273 TASK: ffff8bdaf61762a0 COMMAND: "find" RT PRIO_ARRAY: ffff8bdbffc1ae60 [ 0] PID: 11 TASK: ffff8bdbf94a3150 COMMAND: "watchdog/0" CFS RB_ROOT: ffff8bdbffc1ad68 [120] PID: 9604 TASK: ffff8bdb71559070 COMMAND: "kworker/0:0" [120] PID: 851 TASK: ffff8bdbf8188000 COMMAND: "gmain" [120] PID: 829 TASK: ffff8bdb798fc1c0 COMMAND: "sssd_pam" [120] PID: 818 TASK: ffff8bdb7996c1c0 COMMAND: "gssproxy" [120] PID: 827 TASK: ffff8bdb798f9070 COMMAND: "sssd_nss" [116] PID: 772 TASK: ffff8bdbf5efa0e0 COMMAND: "auditd" [120] PID: 556 TASK: ffff8bdbf5b341c0 COMMAND: "systemd-journal" [120] PID: 6 TASK: ffff8bdbf9455230 COMMAND: "ksoftirqd/0" [120] PID: 843 TASK: ffff8bdb7996b150 COMMAND: "NetworkManager" [120] PID: 21075 TASK: ffff8bdbf7fab150 COMMAND: "killall" [120] PID: 21048 TASK: ffff8bdbf7fad230 COMMAND: "gzip" crash> runq -c 1 CPU 1 RUNQUEUE: ffff8bdbffc9acc0 CURRENT: PID: 9235 TASK: ffff8bdb747bd230 COMMAND: "find" RT PRIO_ARRAY: ffff8bdbffc9ae60 [ 0] PID: 12 TASK: ffff8bdbf9500000 COMMAND: "watchdog/1" CFS RB_ROOT: ffff8bdbffc9ad68 [120] PID: 796 TASK: ffff8bdaf6531070 COMMAND: "sssd" [120] PID: 5 TASK: ffff8bdbf94541c0 COMMAND: "kworker/u8:0" [120] PID: 806 TASK: ffff8bdbf596c1c0 COMMAND: "irqbalance" [120] PID: 824 TASK: ffff8bdb798f8000 COMMAND: "sssd_be" [120] PID: 830 TASK: ffff8bdb798fd230 COMMAND: "sssd_ssh" [120] PID: 9436 TASK: ffff8bdbf725c1c0 COMMAND: "find" [120] PID: 1507 TASK: ffff8bdbf8133150 COMMAND: "master" [120] PID: 835 TASK: ffff8bdb79baa0e0 COMMAND: "systemd-logind" [120] PID: 7910 TASK: ffff8bdbf63920e0 COMMAND: "pickup" [120] PID: 1 TASK: ffff8bdbf9450000 COMMAND: "systemd" [120] PID: 472 TASK: ffff8bdaf61741c0 COMMAND: "xfsaild/dm-0" [120] PID: 735 TASK: ffff8bdb79bab150 COMMAND: "xfsaild/vda1" [120] PID: 6912 TASK: ffff8bdb7155a0e0 COMMAND: "cifsd" crash> runq -c 2 CPU 2 RUNQUEUE: ffff8bdbffd1acc0 CURRENT: PID: 9234 TASK: ffff8bdbf9551070 COMMAND: "find" RT PRIO_ARRAY: ffff8bdbffd1ae60 [ 0] PID: 17 TASK: ffff8bdbf9505230 COMMAND: "watchdog/2" CFS RB_ROOT: ffff8bdbffd1ad68 [120] PID: 19 TASK: ffff8bdbf9550000 COMMAND: "ksoftirqd/2" [120] PID: 9527 TASK: ffff8bdb71558000 COMMAND: "kworker/2:2" [100] PID: 775 TASK: ffff8bdb79ba8000 COMMAND: "kworker/2:1H" crash> crash> runq -c 3 CPU 3 RUNQUEUE: ffff8bdbffd9acc0 CURRENT: PID: 9191 TASK: ffff8bdbf8135230 COMMAND: "find" RT PRIO_ARRAY: ffff8bdbffd9ae60 [ 0] PID: 22 TASK: ffff8bdbf9553150 COMMAND: "watchdog/3" CFS RB_ROOT: ffff8bdbffd9ad68 [120] PID: 9 TASK: ffff8bdbf94a1070 COMMAND: "rcu_sched" [120] PID: 1213 TASK: ffff8bdbf7258000 COMMAND: "in:imjournal" [120] PID: 32098 TASK: ffff8bdbf725a0e0 COMMAND: "kworker/3:0" [120] PID: 9246 TASK: ffff8bdbf8a4c1c0 COMMAND: "find" [139] PID: 47 TASK: ffff8bdbf8d562a0 COMMAND: "khugepaged" [120] PID: 817 TASK: ffff8bdbf5efb150 COMMAND: "rpcbind" [120] PID: 828 TASK: ffff8bdb798fb150 COMMAND: "sssd_sudo" [120] PID: 831 TASK: ffff8bdb798fe2a0 COMMAND: "sssd_pac" [120] PID: 17642 TASK: ffff8bdbf818e2a0 COMMAND: "find" [120] PID: 9410 TASK: ffff8bdbf81362a0 COMMAND: "find" [120] PID: 9467 TASK: ffff8bdbf8a4d230 COMMAND: "sshd" [120] PID: 837 TASK: ffff8bdb79bad230 COMMAND: "crond" [120] PID: 30 TASK: ffff8bdbf957c1c0 COMMAND: "khungtaskd" [120] PID: 1525 TASK: ffff8bdaf6440000 COMMAND: "qmgr" [120] PID: 24 TASK: ffff8bdbf9555230 COMMAND: "ksoftirqd/3" [120] PID: 9339 TASK: ffff8bdb747bb150 COMMAND: "kworker/3:1" [120] PID: 9218 TASK: ffff8bdbf818c1c0 COMMAND: "find" [120] PID: 2203 TASK: ffff8bdbf5ce5230 COMMAND: "sshd" Version-Release number of selected component (if applicable): 3.10.0-1109.el7 How reproducible: I think I reproduced this twice but I cannot reproduce it at will Steps to Reproduce: 1. Run reproducer from https://bugzilla.redhat.com/show_bug.cgi?id=1614201#c21 2. While repro is running with server restart, "killall find" and "killall bash" on the cifs client where the script is running Actual results: System hang with all cpus at 100% Expected results: No system hang Additional info: I have a vmcore I can post the location. Due to lack of reproducibility, even though it is high severity I would view this as a low priority right now (unless we can show it is easier to repro and/or likely to be hit by customers).
I actually think I can reproduce this now but I'll have to work on refinements next week and see if I can get it down to a script that is 100% reliable. I think it requires the 'find' processes to be stuck like this first: crash> ps -m find [0 00:00:06.454] [IN] PID: 11276 TASK: ffff9b60b677b150 CPU: 6 COMMAND: "find" [0 00:00:06.493] [IN] PID: 11282 TASK: ffff9b60b5c88000 CPU: 3 COMMAND: "find" [0 00:00:06.493] [IN] PID: 11281 TASK: ffff9b61b95be2a0 CPU: 4 COMMAND: "find" [0 00:00:06.497] [IN] PID: 11275 TASK: ffff9b612fcd62a0 CPU: 2 COMMAND: "find" [0 00:00:06.509] [IN] PID: 11272 TASK: ffff9b60b6779070 CPU: 3 COMMAND: "find" [0 00:00:06.509] [IN] PID: 11283 TASK: ffff9b60b5d7d230 CPU: 0 COMMAND: "find" [0 00:00:06.509] [IN] PID: 11280 TASK: ffff9b612fcd1070 CPU: 5 COMMAND: "find" [0 00:00:06.509] [IN] PID: 11279 TASK: ffff9b60b6468000 CPU: 2 COMMAND: "find" [0 00:00:06.525] [IN] PID: 11278 TASK: ffff9b60b5d7c1c0 CPU: 7 COMMAND: "find" [0 00:00:06.525] [IN] PID: 11274 TASK: ffff9b61b869d230 CPU: 4 COMMAND: "find" crash> bt 11274 PID: 11274 TASK: ffff9b61b869d230 CPU: 4 COMMAND: "find" #0 [ffff9b61b6763a18] __schedule at ffffffffb3b848aa #1 [ffff9b61b6763aa8] schedule at ffffffffb3b84d59 #2 [ffff9b61b6763ab8] schedule_timeout at ffffffffb3b827a8 #3 [ffff9b61b6763b60] smb2_reconnect at ffffffffc0460167 [cifs] #4 [ffff9b61b6763bf0] smb2_plain_req_init at ffffffffc04604ae [cifs] #5 [ffff9b61b6763c30] SMB2_open at ffffffffc0461e0f [cifs] #6 [ffff9b61b6763d50] smb2_query_dir_first at ffffffffc0458607 [cifs] #7 [ffff9b61b6763dd8] initiate_cifs_search at ffffffffc0450bfd [cifs] #8 [ffff9b61b6763e38] cifs_readdir at ffffffffc0451b2b [cifs] #9 [ffff9b61b6763ea8] iterate_dir at ffffffffb3662677 #10 [ffff9b61b6763ee0] sys_getdents at ffffffffb3662b92 #11 [ffff9b61b6763f50] system_call_fastpath at ffffffffb3b91ed2 RIP: 00007ff7a06db275 RSP: 00007ffc085f5b78 RFLAGS: 00000246 RAX: 000000000000004e RBX: 000000000220e4f0 RCX: ffffffffb3b91e15 RDX: 0000000000008000 RSI: 000000000220e4f0 RDI: 0000000000000005 RBP: 000000000220e4f0 R8: 0000000000000001 R9: 0000000000000018 R10: 0000000000000100 R11: 0000000000000246 R12: fffffffffffffe80 R13: 000000000000000b R14: 0000000000000001 R15: 0000000000000000 ORIG_RAX: 000000000000004e CS: 0033 SS: 002b crash> ps | grep cifsd 11224 2 3 ffff9b60b5c8e2a0 UN 0.0 0 0 [cifsd] crash> bt 11224 PID: 11224 TASK: ffff9b60b5c8e2a0 CPU: 3 COMMAND: "cifsd" #0 [ffff9b6119e2f9a8] __schedule at ffffffffb3b848aa #1 [ffff9b6119e2fa38] schedule at ffffffffb3b84d59 #2 [ffff9b6119e2fa48] schedule_timeout at ffffffffb3b827a8 #3 [ffff9b6119e2faf8] sk_wait_data at ffffffffb3a37b69 #4 [ffff9b6119e2fb58] tcp_recvmsg at ffffffffb3aac7c3 #5 [ffff9b6119e2fbf8] inet_recvmsg at ffffffffb3adb5b0 #6 [ffff9b6119e2fc28] sock_recvmsg at ffffffffb3a324f5 #7 [ffff9b6119e2fd98] kernel_recvmsg at ffffffffb3a3256a #8 [ffff9b6119e2fdb8] cifs_readv_from_socket at ffffffffc0430144 [cifs] #9 [ffff9b6119e2fe58] cifs_demultiplex_thread at ffffffffc043066c [cifs] #10 [ffff9b6119e2fec8] kthread at ffffffffb34c6451 Then issue the 'killall find' and the machine will go bonkers.
Forgot to put the code in there PID: 9273 TASK: ffff8bdaf61762a0 CPU: 0 COMMAND: "find" [exception RIP: __pv_queued_spin_lock_slowpath+156] RIP: ffffffff9db17a1c RSP: ffff8bdb6eedbac8 RFLAGS: 00000046 RAX: 0000000000000000 RBX: ffff8bdbf723e168 RCX: 0000000000000001 RDX: 0000000000190000 RSI: 0000000000000000 RDI: ffff8bdbf723e168 RBP: ffff8bdb6eedbaf8 R8: ffff8bdb6eedbba8 R9: 000000000000004c R10: 0000000000000000 R11: 0000000000000001 R12: ffff8bdbffc1b8c0 R13: 0000000000000001 R14: 0000000000010000 R15: ffff8bdaf61762a0 CS: 0010 SS: 0018 #0 [ffff8bdb6eedbb00] queued_spin_lock_slowpath at ffffffff9e178fd0 #1 [ffff8bdb6eedbb10] _raw_spin_lock_irqsave at ffffffff9e187707 #2 [ffff8bdb6eedbb28] prepare_to_wait at ffffffff9dac7007 #3 [ffff8bdb6eedbb60] smb2_reconnect at ffffffffc05df146 [cifs] crash> dis -lr ffffffffc05df146 | tail --lines=20 0xffffffffc05df0ef <smb2_reconnect+175>: mov %rax,%r15 /usr/src/debug/kernel-3.10.0-1109.el7/linux-3.10.0-1109.el7.x86_64/fs/cifs/smb2pdu.c: 212 0xffffffffc05df0f2 <smb2_reconnect+178>: mov %rcx,-0x78(%rbp) 0xffffffffc05df0f6 <smb2_reconnect+182>: jmpq 0xffffffffc05df18d <smb2_reconnect+333> 0xffffffffc05df0fb <smb2_reconnect+187>: nopl 0x0(%rax,%rax,1) 0xffffffffc05df100 <smb2_reconnect+192>: mov -0x70(%rbp),%rax 0xffffffffc05df104 <smb2_reconnect+196>: lea 0x168(%r12),%r14 0xffffffffc05df10c <smb2_reconnect+204>: movq $0x0,-0x58(%rbp) 0xffffffffc05df114 <smb2_reconnect+212>: movq $0xffffffff9dac7540,-0x48(%rbp) 0xffffffffc05df11c <smb2_reconnect+220>: mov %rax,-0x50(%rbp) 0xffffffffc05df120 <smb2_reconnect+224>: mov -0x78(%rbp),%rax 0xffffffffc05df124 <smb2_reconnect+228>: mov %rax,-0x40(%rbp) 0xffffffffc05df128 <smb2_reconnect+232>: mov %rax,-0x38(%rbp) 0xffffffffc05df12c <smb2_reconnect+236>: mov $0x2710,%eax 0xffffffffc05df131 <smb2_reconnect+241>: lea -0x58(%rbp),%rsi 0xffffffffc05df135 <smb2_reconnect+245>: mov $0x1,%edx 0xffffffffc05df13a <smb2_reconnect+250>: mov %r14,%rdi 0xffffffffc05df13d <smb2_reconnect+253>: mov %rax,-0x60(%rbp) 0xffffffffc05df141 <smb2_reconnect+257>: callq 0xffffffff9dac6fe0 <prepare_to_wait> 0xffffffffc05df146 <smb2_reconnect+262>: cmpl $0x3,0x48(%r12) 152 static int 153 smb2_reconnect(__le16 smb2_command, struct cifs_tcon *tcon) 154 { 155 int rc = 0; 156 struct nls_table *nls_codepage; 157 struct cifs_ses *ses; 158 struct TCP_Server_Info *server; 159 ... 192 /* 193 * Give demultiplex thread up to 10 seconds to reconnect, should be 194 * greater than cifs socket timeout which is 7 seconds 195 */ 196 while (server->tcpStatus == CifsNeedReconnect) { 197 /* 198 * Return to caller for TREE_DISCONNECT and LOGOFF and CLOSE 199 * here since they are implicitly done when session drops. 200 */ 201 switch (smb2_command) { 202 /* 203 * BB Should we keep oplock break and add flush to exceptions? 204 */ 205 case SMB2_TREE_DISCONNECT: 206 case SMB2_CANCEL: 207 case SMB2_CLOSE: 208 case SMB2_OPLOCK_BREAK: 209 return -EAGAIN; 210 } 211 212--> wait_event_interruptible_timeout(server->response_q, 213 (server->tcpStatus != CifsNeedReconnect), 10 * HZ); 214 215 /* are we still trying to reconnect? */ 216 if (server->tcpStatus != CifsNeedReconnect) 217 break; 218 219 /* 220 * on "soft" mounts we wait once. Hard mounts keep 221 * retrying until process is killed or server comes 222 * back on-line 223 */ 224 if (!tcon->retry) { 225 cifs_dbg(FYI, "gave up waiting on reconnect in smb_init\n"); 226 return -EHOSTDOWN; 227 } 228 }
This is what we appear to need here: commit 7ffbe65578b44fafdef577a360eb0583929f7c6e Author: Paulo Alcantara <paulo> Date: Thu Jul 5 13:46:34 2018 -0300 cifs: Fix infinite loop when using hard mount option For every request we send, whether it is SMB1 or SMB2+, we attempt to reconnect tcon (cifs_reconnect_tcon or smb2_reconnect) before carrying out the request. So, while server->tcpStatus != CifsNeedReconnect, we wait for the reconnection to succeed on wait_event_interruptible_timeout(). If it returns, that means that either the condition was evaluated to true, or timeout elapsed, or it was interrupted by a signal. Since we're not handling the case where the process woke up due to a received signal (-ERESTARTSYS), the next call to wait_event_interruptible_timeout() will _always_ fail and we end up looping forever inside either cifs_reconnect_tcon() or smb2_reconnect(). Here's an example of how to trigger that: $ mount.cifs //foo/share /mnt/test -o username=foo,password=foo,vers=1.0,hard (break connection to server before executing bellow cmd) $ stat -f /mnt/test & sleep 140 [1] 2511 $ ps -aux -q 2511 USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND root 2511 0.0 0.0 12892 1008 pts/0 S 12:24 0:00 stat -f /mnt/test $ kill -9 2511 (wait for a while; process is stuck in the kernel) $ ps -aux -q 2511 USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND root 2511 83.2 0.0 12892 1008 pts/0 R 12:24 30:01 stat -f /mnt/test By using 'hard' mount point means that cifs.ko will keep retrying indefinitely, however we must allow the process to be killed otherwise it would hang the system. Signed-off-by: Paulo Alcantara <palcantara> Cc: stable.org Reviewed-by: Aurelien Aptel <aaptel> Signed-off-by: Steve French <stfrench>
The above commit backports cleanly on top of 1109.el7 and there's no references upstream to that commit (i.e. a regression) and it's CC stable so I'm going to backport it.
Created attachment 1634441 [details] test case for this bug that reproduces panic in around 60 seconds with unpatched 3.10.0-1109.el7
FWIW, this bug goes back to RHEL7.0 so it's not a regression. It only reproduces with a non-default mount option ('hard'), but still seems like a good candidate for RHEL7. The patch was in RHEL8.0 so the problem is unique to RHEL7.
Patch(es) committed on kernel-3.10.0-1119.el7
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:1016