Bug 1737511
| Summary: | RHEL 7.6: smb2.1 client repeated cifsd kernel panic due to stale reference to old socket | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux 7 | Reporter: | Frank Sorenson <fsorenso> |
| Component: | kernel | Assignee: | cifs-maint |
| kernel sub component: | CIFS | QA Contact: | Filesystem QE <fs-qe> |
| Status: | CLOSED DUPLICATE | Docs Contact: | |
| Severity: | high | ||
| Priority: | medium | CC: | akaiser, dwysocha, lsahlber, rdieter, xzhou |
| Version: | 7.6 | ||
| Target Milestone: | rc | ||
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2019-11-05 20:00:13 UTC | Type: | --- |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Frank Sorenson
2019-08-05 14:33:44 UTC
vmcore 833419566
PID: 7698 TASK: ffff9d0575a9a080 CPU: 5 COMMAND: "cifsd"
[exception RIP: sk_wait_data+0xc5]
#10 [ffff9d0568603b68] tcp_recvmsg at ffffffff94691bd3
#11 [ffff9d0568603c08] inet_recvmsg at ffffffff946c00e0
#12 [ffff9d0568603c38] sock_recvmsg at ffffffff946194f5
#13 [ffff9d0568603da0] kernel_recvmsg at ffffffff9461956a
#14 [ffff9d0568603dc0] cifs_readv_from_socket at ffffffffc0601045 [cifs]
#15 [ffff9d0568603e58] cifs_demultiplex_thread at ffffffffc060155c [cifs]
#16 [ffff9d0568603ec8] kthread at ffffffff940c1c71
#17 [ffff9d0568603f50] ret_from_fork_nospec_begin at ffffffff94774c1d
int cifs_readv_from_socket(struct TCP_Server_Info *, struct kvec *, unsigned int, unsigned int);
3 RDI: 0xffff9d057675a000
7 RSI: 0xffff9d0568603e80
4 RDX: 0x1
5 RCX: 0x4
that's this connection:
3) (struct TCP_Server_Info *)0xffff9d057675a000
hostname: '10.1.2.3'
task: (struct task_struct *)0xffff9d0575a9a080 - pid: 7698 - cifsd
TCP Status: 0x3 (CifsNeedReconnect)
last response 20 jiffies (0 0:00:00.020)
Local users to server: 1 requests on wire: 0
sec_mode: 0x1 (MAY_SIGN)
CurrentMid: 0x002a1cc0
socket: (struct socket *)0xffff9d0547dd3480, (struct sock *)0xffff9d0050bb5540
src: (None, None), dst: ('10.2.3.4', 445)
request_q has 2 entries: list -H 0xffff9d057675a188
0xffff9d05707ba080 14776 grep UN 0 0:00:00.018
0xffff9d05446dc100 16978 kworker/u16:1 UN 0 0:00:00.018
response_q has 0 entries: list -H 0xffff9d057675a170
pending_mid_q has 0 entries: list -H 0xffff9d057675a198 -s mid_q_entry
1 smb session(s): list -H 0xffff9d057675a010 -o cifs_ses.smb_ses_list -s cifs_ses
1) (struct cifs_ses *)0xffff9d057be73c00
.serverName: '10.2.3.4', .serverDomain: 'None'
.serverOS: None
.serverNOS: None
.ses_count: 1, capabilities: 0x300036: CAP_MPX_MODE | CAP_UNICODE | CAP_NT_SMBS | CAP_RPC_REMOTE_APIS | 0x300000
.SMB session status: 1 (CifsGood), .need_reconnect: True
.auth_key.len: 16 - 1f5c7863 623f087f 1a5c7838 645c7864
.user_name: 'cifsuser', domainName: 'LOCAL_CLUSTER'
.tcon_ipc: (struct cifs_tcon *)0xffff9d057e253000 - tree: '\\10.2.3.4\IPC$', need_reconnect: True
.ipc: True, .pipe: True, .retry: False, .nocase: False, .seal: False, .unix_ext: False, .local_lease: False
.broken_posix_open: False, .broken_sparse_sup: False, .need_reopen_files: False, .use_resilient: False, .use_persistent: False
.capabilities: 0x0 -
.nativeFileSystem: 'None', .tc_count: 1
mount options: versdefault,sec=unknown?
open files: list -H 0xffff9d057e253028 -s cifsFileInfo
.rlist: list -H 0xffff9d057e253018 - []
.tcon_list: list -H 0xffff9d057be73c10 -s cifs_tcon
1) (struct cifs_tcon *)0xffff9d057e253800 - tree: '\\10.2.3.4\archive', need_reconnect: True
.ipc: False, .pipe: False, .retry: False, .nocase: False, .seal: False, .unix_ext: False, .local_lease: False
.broken_posix_open: False, .broken_sparse_sup: False, .need_reopen_files: False, .use_resilient: False, .use_persistent: True
.capabilities: 0x10 - SMB2_SHARE_CAP_CONTINUOUS_AVAILABILITY
.nativeFileSystem: 'None', .tc_count: 1
mount options: versdefault,sec=unknown?
open files: list -H 0xffff9d057e253828 -s cifsFileInfo
1) (struct cifsFileInfo *)0xffff9d057de1ce00
partial path: 'path/filename1.csv'
.dentry: 0xffff9d0547cc30c0
.uid: <kuid_t 0xffff9d057de1ce28>, .pid: 14776 'grep'
2) (struct cifsFileInfo *)0xffff9d0546f90a00
partial path: '/path/filename2.csv'
.dentry: 0xffff9d057c5b6e40
.uid: <kuid_t 0xffff9d0546f90a28>, .pid: 14776 'grep'
3) (struct cifsFileInfo *)0xffff9d0555f9c400
partial path: '/path/filename3'
.dentry: 0xffff9d0572bd23c0
.uid: <kuid_t 0xffff9d0555f9c428>, .pid: 13692 'grep'
.rlist: list -H 0xffff9d057e253818 - []
looking at the data on the stack:
int kernel_recvmsg(struct socket *, struct msghdr *, struct kvec *, size_t, size_t, int);
5 RDI: 0xffff9d0547dd3480
6 RSI: 0xffff9d0568603de8
1 RDX: 0xffff9d0576ff1810
4 RCX: 0x1
crash> socket.sk ffff9d0547dd3480
sk = 0xffff9d0050bb5540
crash> xportshow -tv
...
windows: rcv=29312, snd=14480 advmss=1448 rcv_ws=7 snd_ws=7
nonagle=0 sack_ok=3 tstamp_ok=1
rmem_alloc=0, wmem_alloc=1
rx_queue=0, tx_queue=0
rcvbuf=367360, sndbuf=87040
rcv_tstamp=0.02 s, lsndtime=0.02 s ago, RTO=201 ms
however, sk_wait_data was called with a different sock: ffff9d05747ef440
ffff9d05747ef440
crash> tcp_sock.inet_conn ffff9d05747ef440 -ox
struct tcp_sock {
[ffff9d05747ef440] struct inet_connection_sock inet_conn;
crash> inet_connection_sock.icsk_inet ffff9d05747ef440 -ox
struct inet_connection_sock {
[ffff9d05747ef440] struct inet_sock icsk_inet;
crash> inet_sock.sk ffff9d05747ef440 -ox
struct inet_sock {
[ffff9d05747ef440] struct sock sk;
crash> struct sock.sk_socket ffff9d05747ef440
sk_socket = 0x0
the sock was saved from socket->sk in inet_recvmsg
int inet_recvmsg(struct kiocb *iocb, struct socket *sock, struct msghdr *msg,
size_t size, int flags)
{
struct sock *sk = sock->sk;
the socket was obtained from TCP_Server_Info->ssocket in cifs_readv_from_socket:
int
cifs_readv_from_socket(struct TCP_Server_Info *server, struct kvec *iov_orig,
unsigned int nr_segs, unsigned int to_read)
{
...
for (total_read = 0; to_read; total_read += length, to_read -= length) {
try_to_freeze();
if (server_unresponsive(server)) {
total_read = -ECONNABORTED;
break;
}
segs = kvec_array_init(iov, iov_orig, nr_segs, total_read);
length = kernel_recvmsg(server->ssocket, &smb_msg, <<
iov, segs, to_read, 0); <<
int kernel_recvmsg(struct socket *sock, struct msghdr *msg,
struct kvec *vec, size_t num, size_t size, int flags)
{
...
msg->msg_iov = (struct iovec *)vec, msg->msg_iovlen = num;
result = sock_recvmsg(sock, msg, size, flags); <<<<
int sock_recvmsg(struct socket *sock, struct msghdr *msg,
size_t size, int flags)
{
struct kiocb iocb;
struct sock_iocb siocb; <<<< sock_iocb
int ret;
init_sync_kiocb(&iocb, NULL);
iocb.private = &siocb;
ret = __sock_recvmsg(&iocb, sock, msg, size, flags); <<<<
static inline int __sock_recvmsg(struct kiocb *iocb, struct socket *sock,
struct msghdr *msg, size_t size, int flags)
{
int err = security_socket_recvmsg(sock, msg, size, flags);
return err ?: __sock_recvmsg_nosec(iocb, sock, msg, size, flags); <<<<
static inline int __sock_recvmsg_nosec(struct kiocb *iocb, struct socket *sock,
struct msghdr *msg, size_t size, int flags)
{
struct sock_iocb *si = kiocb_to_siocb(iocb);
si->sock = sock; <<<< socket stored
si->scm = NULL;
si->msg = msg;
si->size = size;
si->flags = flags;
return sock->ops->recvmsg(iocb, sock, msg, size, flags);
int inet_recvmsg(struct kiocb *iocb, struct socket *sock, struct msghdr *msg,
size_t size, int flags)
{
struct sock *sk = sock->sk;
int addr_len = 0;
int err;
sock_rps_record_flow(sk);
err = sk->sk_prot->recvmsg(iocb, sk, msg, size, flags & MSG_DONTWAIT,
flags & ~MSG_DONTWAIT, &addr_len);
int tcp_recvmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
size_t len, int nonblock, int flags, int *addr_len)
{
...
do {
...
if (copied >= target) {
/* Do not sleep, just process backlog. */
release_sock(sk);
lock_sock(sk);
} else {
sk_wait_data(sk, &timeo, last);
}
int sk_wait_data(struct sock *sk, long *timeo, const struct sk_buff *skb)
{
int rc;
DEFINE_WAIT(wait);
prepare_to_wait(sk_sleep(sk), &wait, TASK_INTERRUPTIBLE);
set_bit(SOCK_ASYNC_WAITDATA, &sk->sk_socket->flags); <<<<
sk->sk_socket is NULL
that 'struct sock' is:
<struct tcp_sock 0xffff9d05747ef440> TCP
tcp 10.1.2.3:38430 10.2.3.4:445 FIN_WAIT1
windows: rcv=3119360, snd=3145728 advmss=1448 rcv_ws=7 snd_ws=7
nonagle=0 sack_ok=7 tstamp_ok=1
rmem_alloc=0, wmem_alloc=1
rx_queue=0, tx_queue=1
rcvbuf=6291456, sndbuf=4194304
rcv_tstamp=0.03 s, lsndtime=0.04 s ago, RTO=203 ms
note that the new sock has just begun passing data:
the old tcp_sock:
<struct tcp_sock 0xffff9d05747ef440> TCP
tcp 10.1.2.3:38430 10.2.3.4:445 FIN_WAIT1
rcv_tstamp=0.03 s, lsndtime=0.04 s ago, RTO=203 ms
crash> tcp_sock.segs_in,bytes_received,segs_out,rcv_tstamp 0xffff9d05747ef440
segs_in = 0x9bec43
bytes_received = 0x1583b248c2
segs_out = 0x26dd4ae
rcv_tstamp = 0xae9d2f
crash> pd (jiffies & 0xffffffff - 0xae9d2f)
$13 = 64
64 msec
the new tcp_sock:
<struct tcp_sock 0xffff9d0050bb5540> TCP0xffff939820a40f80
tcp 10.1.2.3:49526 10.2.3.4:445 ESTABLISHED
rcv_tstamp=0.02 s, lsndtime=0.02 s ago, RTO=201 ms
crash> tcp_sock.segs_in,bytes_received,segs_out,rcv_tstamp 0xffff9d0050bb5540
segs_in = 0x1
bytes_received = 0x0
segs_out = 0x2
rcv_tstamp = 0xae9d39
crash> pd (jiffies & 0xffffffff - 0xae9d39)
$12 = 68
68 msec
==========================================================================================
vmcore 953557243
int cifs_readv_from_socket(struct TCP_Server_Info *, struct kvec *, unsigned int, unsigned int);
#14 cifs_readv_from_socket called from 0xffffffffc065255c <cifs_demultiplex_thread+284>
3 RDI: 0xffff9389a8e46000
7 RSI: 0xffff9398297d7e80
4 RDX: 0x1
5 RCX: 0x4
(again see cifs_info.out)
3) (struct TCP_Server_Info *)0xffff9389a8e46000
hostname: '10.2.3.4'
task: (struct task_struct *)0xffff9398306730c0 - pid: 11361 - cifsd
TCP Status: 0x4 (CifsNeedNegotiate)
last response 22 jiffies (0 0:00:00.022)
Local users to server: 1 requests on wire: 1
sec_mode: 0x1 (MAY_SIGN)
CurrentMid: 0x00000001
socket: (struct socket *)0xffff93983d149b80, (struct sock *)0xffff93987fd83e00
*** TODO: socket state
src: (None, None), dst: ('10.2.3.4', 445)
request_q has 2 entries: list -H 0xffff9389a8e46188
0xffff93952b7a5140 10203 perl UN 0 0:00:00.000
0xffff93981caa5140 82652 kworker/u256:3 UN 0 0:00:00.000
response_q has 1 entries: list -H 0xffff9389a8e46170
0xffff939634f8e180 9185 grep UN 0 0:00:00.020
pending_mid_q has 1 entries: list -H 0xffff9389a8e46198 -s mid_q_entry
0xffff9397b972ae00 mid: 0x00000000 MID_FREE | MID_REQUEST_SUBMITTED Create Directory
allocated 22 jiffies ago (0 0:00:00.022)
0xffff939634f8e180 9185 grep UN 0 0:00:00.020
1 smb session(s): list -H 0xffff9389a8e46010 -o cifs_ses.smb_ses_list -s cifs_ses
1) (struct cifs_ses *)0xffff939834efae00
.serverName: '10.2.3.4', .serverDomain: 'None'
.serverOS: None
.serverNOS: None
.ses_count: 1, capabilities: 0x300036: CAP_MPX_MODE | CAP_UNICODE | CAP_NT_SMBS | CAP_RPC_REMOTE_APIS | 0x300000
.SMB session status: 1 (CifsGood), .need_reconnect: True
.auth_key.len: 16 - 5c786339 260e5c78 64665c78 64645c78
.user_name: 'cifsuser', domainName: 'LOCAL_CLUSTER'
.tcon_ipc: (struct cifs_tcon *)0xffff9389a8e41800 - tree: '\\10.2.3.4\IPC$', need_reconnect: True
.ipc: True, .pipe: True, .retry: False, .nocase: False, .seal: False, .unix_ext: False, .local_lease: False
.broken_posix_open: False, .broken_sparse_sup: False, .need_reopen_files: False, .use_resilient: False, .use_persistent: False
.capabilities: 0x0 -
.nativeFileSystem: 'None', .tc_count: 1
mount options: vers3.0,sec=unknown?
open files: list -H 0xffff9389a8e41828 -s cifsFileInfo
.rlist: list -H 0xffff9389a8e41818 - []
.tcon_list: list -H 0xffff939834efae10 -s cifs_tcon
1) (struct cifs_tcon *)0xffff9389a8e40000 - tree: '\\10.2.3.4\archive', need_reconnect: True
.ipc: False, .pipe: False, .retry: False, .nocase: False, .seal: False, .unix_ext: False, .local_lease: False
.broken_posix_open: False, .broken_sparse_sup: False, .need_reopen_files: False, .use_resilient: False, .use_persistent: True
.capabilities: 0x10 - SMB2_SHARE_CAP_CONTINUOUS_AVAILABILITY
.nativeFileSystem: 'None', .tc_count: 1
mount options: vers3.0,sec=unknown?
open files: list -H 0xffff9389a8e40028 -s cifsFileInfo
1) (struct cifsFileInfo *)0xffff93982937b400
partial path: '/path/filename1.csv'
.dentry: 0xffff93983d284840
.uid: <kuid_t 0xffff93982937b428>, .pid: 10203 'perl'
2) (struct cifsFileInfo *)0xffff9396d2fa3200
partial path: '/path/filename2'
.dentry: 0xffff938fd25c2300
.uid: <kuid_t 0xffff9396d2fa3228>, .pid: 9185 'grep'
.rlist: list -H 0xffff9389a8e40018 - []
so:
socket: (struct socket *)0xffff93983d149b80, (struct sock *)0xffff93987fd83e00
crash> xportshow -tv
<struct tcp_sock 0xffff93987fd83e00> TCP
tcp 10.1.2.3:55964 10.2.3.4:445 ESTABLISHED
windows: rcv=30336, snd=14592 advmss=1448 rcv_ws=7 snd_ws=7
nonagle=0 sack_ok=3 tstamp_ok=1
rmem_alloc=2176, wmem_alloc=1
rx_queue=212, tx_queue=0
rcvbuf=367360, sndbuf=87040
rcv_tstamp=0.02 s, lsndtime=0.02 s ago, RTO=201 ms
and there's another sock:
<struct tcp_sock 0xffff9398242ebe00> TCP
tcp 10.1.2.3:48146 10.2.3.4:445 FIN_WAIT1
windows: rcv=3145728, snd=3145088 advmss=1448 rcv_ws=7 snd_ws=7
nonagle=0 sack_ok=7 tstamp_ok=1
rmem_alloc=0, wmem_alloc=1
rx_queue=0, tx_queue=1
rcvbuf=6291456, sndbuf=4194304
rcv_tstamp=0.02 s, lsndtime=0.02 s ago, RTO=201 ms
which is the one on which we crashed:
crash> sock.sk_socket 0xffff9398242ebe00
sk_socket = 0x0
so we should have been waiting on this sock:
<struct tcp_sock 0xffff93987fd83e00> TCP
tcp 10.1.2.3:55964 10.2.3.4:445 ESTABLISHED
but instead were waiting on:
<struct tcp_sock 0xffff9398242ebe00> TCP
tcp 10.1.2.3:48146 10.2.3.4:445 FIN_WAIT1
for the valid sock:
crash> search -t 0xffff93987fd83e00 | grep PID | awk '{print $2}' | paste -s - -d ' '
9185
crash> ps -m 9185
[0 00:00:00.021] [UN] PID: 9185 TASK: ffff939634f8e180 CPU: 8 COMMAND: "grep"
crash> bt -f 9185 | egrep -e ffff93987fd83e00 -e PID | egrep -e ffff93987fd83e00 -B1 | grep PID
(so this address is only found in pid 9185, and only in the remnant of the stack, however the process has only just run, and is in the process of reconnecting:
PID: 9185 TASK: ffff939634f8e180 CPU: 8 COMMAND: "grep"
#0 [ffff9396cbb436b8] __schedule at ffffffff9eb67747
#1 [ffff9396cbb43748] schedule at ffffffff9eb67c49
#2 [ffff9396cbb43758] wait_for_response at ffffffffc066c253 [cifs]
#3 [ffff9396cbb437b0] cifs_send_recv at ffffffffc066d3e7 [cifs]
#4 [ffff9396cbb43800] smb2_send_recv at ffffffffc066d9b1 [cifs]
#5 [ffff9396cbb43928] SMB2_negotiate at ffffffffc068245d [cifs]
#6 [ffff9396cbb439b8] smb2_negotiate at ffffffffc067a75d [cifs]
#7 [ffff9396cbb439c8] cifs_negotiate_protocol at ffffffffc0654f13 [cifs]
#8 [ffff9396cbb439f0] smb2_reconnect at ffffffffc0681e06 [cifs]
#9 [ffff9396cbb43a80] smb2_plain_req_init at ffffffffc068209e [cifs]
#10 [ffff9396cbb43ac0] smb2_new_read_req.constprop.11 at ffffffffc0683732 [cifs]
#11 [ffff9396cbb43af0] SMB2_read at ffffffffc0685886 [cifs]
#12 [ffff9396cbb43b80] smb2_sync_read at ffffffffc067a4fd [cifs]
#13 [ffff9396cbb43b90] cifs_readpage_worker at ffffffffc065bf76 [cifs]
#14 [ffff9396cbb43c48] cifs_readpage at ffffffffc065c665 [cifs]
#15 [ffff9396cbb43c80] generic_file_aio_read at ffffffff9e5b81f0
#16 [ffff9396cbb43d58] cifs_strict_readv at ffffffffc0662a69 [cifs]
#17 [ffff9396cbb43da0] do_sync_read at ffffffff9e640a83
#18 [ffff9396cbb43e78] vfs_read at ffffffff9e6414bf
#19 [ffff9396cbb43ea8] cb_sys_write at ffffffffc03d4c8b [cbsensor]
#20 [ffff9396cbb43f50] system_call_fastpath at ffffffff9eb74ddb
again, a very new sock:
old tcp_sock:
<struct tcp_sock 0xffff9398242ebe00> TCP
tcp 10.1.2.3:48146 10.2.3.4:445 FIN_WAIT1
crash> tcp_sock.segs_in,bytes_received,segs_out,rcv_tstamp 0xffff9398242ebe00
segs_in = 0xc885fa
bytes_received = 0x17b0ff6548
segs_out = 0x2eeff5f
rcv_tstamp = 0x1a76240
crash> pd (jiffies & 0xffffffff - 0x1a76240)
$20 = 23
23 msec
new tcp_sock:
<struct tcp_sock 0xffff93987fd83e00> TCP
tcp 10.1.2.3:55964 10.2.3.4:445 ESTABLISHED
crash> tcp_sock.segs_in,bytes_received,segs_out,rcv_tstamp 0xffff93987fd83e00
segs_in = 0x3
bytes_received = 0xd4
segs_out = 0x4
rcv_tstamp = 0x1a76241
crash> pd (jiffies & 0xffffffff - 0x1a76241)
$22 = 22
22 msec
|