Bug 1271759
Summary: | [nfs] writing to a udp6 mount results in a lockup | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Artem Savkov <asavkov> | ||||||
Component: | kernel | Assignee: | Benjamin Coddington <bcodding> | ||||||
kernel sub component: | NFS | QA Contact: | JianHong Yin <jiyin> | ||||||
Status: | CLOSED ERRATA | Docs Contact: | |||||||
Severity: | high | ||||||||
Priority: | unspecified | CC: | aquini, bcodding, eguan, hsowa, jbrouer, jburke, jiji, jiyin, jstancek, kzhang, lmiksik, nfs-maint, pbunyan, sdubroca, steved, swhiteho, vyasevic | ||||||
Version: | 7.2 | Keywords: | Regression | ||||||
Target Milestone: | rc | ||||||||
Target Release: | --- | ||||||||
Hardware: | All | ||||||||
OS: | Linux | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | kernel-3.10.0-326.el7 | Doc Type: | Bug Fix | ||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2015-11-19 23:22:13 UTC | Type: | Bug | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Attachments: |
|
Description
Artem Savkov
2015-10-14 15:34:16 UTC
3.10.0-240.el7 is first bad kernel. Reproduced.. (In reply to Jan Stancek from comment #4) > 3.10.0-240.el7 is first bad kernel. It appears to be related to partial UDP checksums introduced in this kernel: BAD: a29633df44be [net] ipv6: Partial checksum only UDP packets 7c921d059440 [net] ipv6: Allow for partial checksums on non-ufo packets GOOD: 4eab0ea762e9 [net] udpv6: Add lockless sendmsg() support I could trigger it on upstream 4.3.0-rc5+ as well. (In reply to Steve Dickson from comment #7) > > BAD: > > a29633df44be [net] ipv6: Partial checksum only UDP packets > > 7c921d059440 [net] ipv6: Allow for partial checksums on non-ufo packets > Should we back these out?? I can confirm that reverting those 2 (along with "acdf399 - [net] ipv6: Fix udp checksums with raw sockets") does resolve the issue. (In reply to Artem Savkov from comment #8) > (In reply to Steve Dickson from comment #7) > > > BAD: > > > a29633df44be [net] ipv6: Partial checksum only UDP packets > > > 7c921d059440 [net] ipv6: Allow for partial checksums on non-ufo packets > > Should we back these out?? > > I can confirm that reverting those 2 (along with "acdf399 - [net] ipv6: Fix > udp checksums with raw sockets") does resolve the issue. Yes, I revert above two commits on kernel-3.10.0-324.el7, it works well. Maybe we need to dig out the deeper reason. Client is sending 3 frags (they look right): 70 94.851257 fc00::3:72 -> fc00::3:73 IPv6 1510 IPv6 fragment (nxt=UDP (17) off=0 id=0xeb0c4d34) 71 94.851361 fc00::3:72 -> fc00::3:73 IPv6 1510 IPv6 fragment (nxt=UDP (17) off=181 id=0xeb0c4d34) 72 94.851381 fc00::3:72 -> fc00::3:73 NFS 1322 [RPC retransmission of #3]V3 WRITE Call, FH: 0xafe34795 Offset: 0 Len: 4000 FILE_SYNC Server is using kernel_recvmsg() and getting -EAGAIN back each time these frames are transmitted: net/sunrpc/svcsock.c: 571 skb = NULL; 572 err = kernel_recvmsg(svsk->sk_sock, &msg, NULL, 573 0, 0, MSG_PEEK | MSG_DONTWAIT); 574 if (err >= 0) 575 skb = skb_recv_datagram(svsk->sk_sk, 0, 1, &err); 576 577 if (skb == NULL) { 578 if (err != -EAGAIN) { 579 /* possibly an icmp error */ 580 dprintk("svc: recvfrom returned error %d\n", -err); 581 set_bit(XPT_DATA, &svsk->sk_xprt.xpt_flags); 582 } 583 return 0; 584 } Traced this back into the server's udpv6_recvmsg().. in the failing case it appears we are jumping to csum_copy_err in this section: net/ipv6/udp.c: 432 if (copied < ulen || UDP_SKB_CB(skb)->partial_cov) { 433 if (udp_lib_checksum_complete(skb)) 434 goto csum_copy_err; 435 } I had a look at the patch in 7c921d059440: + /* If this is the first and only packet and device + * supports checksum offloading, let's use it. + */ + if (!skb && + length + fragheaderlen < mtu && + rt->dst.dev->features & NETIF_F_V6_CSUM && + !exthdrlen) + csummode = CHECKSUM_PARTIAL; I guess maybe (not sure): "length + fragheaderlen < mtu" should be "length + fragheaderlen <= mtu" Created attachment 1083670 [details]
rhel67 write capture
Created attachment 1083671 [details]
rhel72 write capture
Two attachments: attachment 1083670 [details] contains 3 udp frames for an NFS write from a RHEL6.7 NFS client. This WRITE causes the server to respond appropriately attachment 1083671 [details] contains 3 udp frames for an NFS write from a RHEL7.2 NFS client that has BAD: a29633df44be [net] ipv6: Partial checksum only UDP packets 7c921d059440 [net] ipv6: Allow for partial checksums on non-ufo packets This WRITE causes the server to never respond.. the server seems to be behaving as in comment 10. An upstream server behaves the same way. Something is different in these two captures that is causing the server to not respond, but I am not able to find it! The captures were recorded from within the VM on the server side.. ..hmm maybe I need to capture externally to the server. I'll try that next. A capture external to my VMs shows that in the failing case the WRITE tx UDP checksum is incorrect. On the RHEL7.2 client: [root@redhat-72 ~]# ethtool -i eno16777736 driver: e1000 version: 7.3.21-k8-NAPI firmware-version: bus-info: 0000:02:01.0 supports-statistics: yes supports-test: yes supports-eeprom-access: yes supports-register-dump: yes supports-priv-flags: no [root@redhat-72 ~]# ethtool -k eno16777736 |grep tx-check tx-checksumming: on tx-checksum-ipv4: off [fixed] tx-checksum-ip-generic: on tx-checksum-ipv6: off [fixed] tx-checksum-fcoe-crc: off [fixed] tx-checksum-sctp: off [fixed] As soon as I run `ethtool -K eno16777736 tx off` the client's IO completes properly and the problem is not reproduced. I'm suspecting that the combination of udp6 fragments and checksum offload is the key here.. net/ipv6/ip6_output.c: 1256 /* If this is the first and only packet and device 1257 * supports checksum offloading, let's use it. 1258 * Use transhdrlen, same as IPv4, because partial 1259 * sums only work when transhdrlen is set. 1260 */ 1261 if (transhdrlen && sk->sk_protocol == IPPROTO_UDP && 1262 length + fragheaderlen < mtu && 1263 rt->dst.dev->features & NETIF_F_V6_CSUM && 1264 !exthdrlen) 1265 csummode = CHECKSUM_PARTIAL; Just before this check, a probe shows that the length var is short enough to be under mtu, so we set ip_summed to CHECKSUM_PARTIAL. Looks like this is because xs_sendpages() first uses kernel_sendmsg() to send along NFS protocol headers, then decides to use sock_sendpage() or sock_no_sendpage() to send along the page data. kworker/0:1H 656 [000] 78128.764564: probe:udpv6_sendmsg: (ffffffff815f2a40) len=88 7f2a41 udpv6_sendmsg (/usr/lib/debug/lib/modules/3.10.0-324.el7.x86_64/vmlinux) 710c00 sock_sendmsg (/usr/lib/debug/lib/modules/3.10.0-324.el7.x86_64/vmlinux) 710c77 kernel_sendmsg (/usr/lib/debug/lib/modules/3.10.0-324.el7.x86_64/vmlinux) 6f59 xs_send_kvec ([sunrpc]) 70c7 xs_sendpages ([sunrpc]) 72a9 xs_udp_send_request ([sunrpc]) 5586 xprt_transmit ([sunrpc]) 1868 call_transmit ([sunrpc]) bbf4 __rpc_execute ([sunrpc]) bfa6 rpc_async_schedule ([sunrpc]) 29d5fb process_one_work (/usr/lib/debug/lib/modules/3.10.0-324.el7.x86_64/vmlinux) 29e3cb worker_thread (/usr/lib/debug/lib/modules/3.10.0-324.el7.x86_64/vmlinux) 2a5aef kthread (/usr/lib/debug/lib/modules/3.10.0-324.el7.x86_64/vmlinux) 845718 ret_from_fork (/usr/lib/debug/lib/modules/3.10.0-324.el7.x86_64/vmlinux) kworker/0:1H 656 [000] 78128.764605: probe:__ip6_append_data_2: (ffffffff815d7dd7) transhdrlen=8 length=144 fragheaderlen=28 7d7dd8 __ip6_append_data.isra.32 (/usr/lib/debug/lib/modules/3.10.0-324.el7.x86_64/vmlinux) 7d866d ip6_append_data (/usr/lib/debug/lib/modules/3.10.0-324.el7.x86_64/vmlinux) 7f2cc5 udpv6_sendmsg (/usr/lib/debug/lib/modules/3.10.0-324.el7.x86_64/vmlinux) 7a0e24 inet_sendmsg (/usr/lib/debug/lib/modules/3.10.0-324.el7.x86_64/vmlinux) 710c00 sock_sendmsg (/usr/lib/debug/lib/modules/3.10.0-324.el7.x86_64/vmlinux) 710c77 kernel_sendmsg (/usr/lib/debug/lib/modules/3.10.0-324.el7.x86_64/vmlinux) 6f59 xs_send_kvec ([sunrpc]) 70c7 xs_sendpages ([sunrpc]) 72a9 xs_udp_send_request ([sunrpc]) 5586 xprt_transmit ([sunrpc]) 1868 call_transmit ([sunrpc]) bbf4 __rpc_execute ([sunrpc]) bfa6 rpc_async_schedule ([sunrpc]) 29d5fb process_one_work (/usr/lib/debug/lib/modules/3.10.0-324.el7.x86_64/vmlinux) 29e3cb worker_thread (/usr/lib/debug/lib/modules/3.10.0-324.el7.x86_64/vmlinux) 2a5aef kthread (/usr/lib/debug/lib/modules/3.10.0-324.el7.x86_64/vmlinux) 845718 ret_from_fork (/usr/lib/debug/lib/modules/3.10.0-324.el7.x86_64/vmlinux) kworker/0:1H 656 [000] 78128.764615: probe:udpv6_sendmsg: (ffffffff815f2a40) len=fa0 7f2a41 udpv6_sendmsg (/usr/lib/debug/lib/modules/3.10.0-324.el7.x86_64/vmlinux) 710c00 sock_sendmsg (/usr/lib/debug/lib/modules/3.10.0-324.el7.x86_64/vmlinux) 710c77 kernel_sendmsg (/usr/lib/debug/lib/modules/3.10.0-324.el7.x86_64/vmlinux) 7130f9 sock_no_sendpage (/usr/lib/debug/lib/modules/3.10.0-324.el7.x86_64/vmlinux) 71c6 xs_sendpages ([sunrpc]) 72a9 xs_udp_send_request ([sunrpc]) 5586 xprt_transmit ([sunrpc]) 1868 call_transmit ([sunrpc]) bbf4 __rpc_execute ([sunrpc]) bfa6 rpc_async_schedule ([sunrpc]) 29d5fb process_one_work (/usr/lib/debug/lib/modules/3.10.0-324.el7.x86_64/vmlinux) 29e3cb worker_thread (/usr/lib/debug/lib/modules/3.10.0-324.el7.x86_64/vmlinux) 2a5aef kthread (/usr/lib/debug/lib/modules/3.10.0-324.el7.x86_64/vmlinux) 845718 ret_from_fork (/usr/lib/debug/lib/modules/3.10.0-324.el7.x86_64/vmlinux) So the first pass through __ip6_append_data() gets us the ip_summed == CHECKSUM_PARTIAL, and then the additional data becomes large enough to fragment. (In reply to Benjamin Coddington from comment #25) > So the first pass through __ip6_append_data() gets us the ip_summed == > CHECKSUM_PARTIAL, and then the additional data becomes large enough to > fragment. So, that's ok. When udp_v6_push_pending_frames() is called, we'll end up calling udp_v6_send_skb() which should correct the checksum in the above case (see udp6_hwcsum_outgoing). We should be dropping the partial checksum and re-calculating the software checksum... I am trying to duplicate this issue with unit tests and I think it might depend on the hw.... I am going to try on a few different nics... I've acked the fix Hannes proposed, but I am not sure if that's the real issue ATM. -vlad Patch(es) available on kernel-3.10.0-326.el7 Verified: [root@hp-dl388g8-18 ~]# mkdir nfs [root@hp-dl388g8-18 ~]# mount -t nfs -o nfsvers=3,proto=udp6 rhel6-nfs:/export/home nfs mount.nfs: Failed to resolve server rhel6-nfs: Name or service not known [root@hp-dl388g8-18 ~]# hostname hp-dl388g8-18.rhts.eng.pek2.redhat.com [root@hp-dl388g8-18 ~]# mount -t nfs -o nfsvers=3,proto=udp6 rhel6-nfs.rhts.eng.bos.redhat.com:/export/home nfs [root@hp-dl388g8-18 ~]# LANG=C dd if=/dev/urandom of=nfs/testfile bs=1297 count=1 1+0 records in 1+0 records out 1297 bytes (1.3 kB) copied, 0.293311 s, 4.4 kB/s [root@hp-dl388g8-18 ~]# uname -r 3.10.0-326.el7.x86_64 reproduced: Beaker Test information: HOSTNAME=hp-dl388g8-15.rhts.eng.pek2.redhat.com JOBID=1127329 RECIPEID=2293791 RESULT_SERVER=[::1]:7096 DISTRO=RHEL-7.2-20151008.0 ARCHITECTURE=x86_64 Job Whiteboard: RHEL-7 ss4 Recipe Whiteboard: ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** [root@hp-dl388g8-15 ~]# [root@hp-dl388g8-15 ~]# [root@hp-dl388g8-15 ~]# mkdir nfs [root@hp-dl388g8-15 ~]# mount -t nfs -o nfsvers=3,proto=udp6 rhel6-nfs:/export/home nfs^C [root@hp-dl388g8-15 ~]# mount -t nfs -o nfsvers=3,proto=udp6 rhel6-nfs.rhts.eng.bos.redhat.com:/export/home nfs [root@hp-dl388g8-15 ~]# LANG=C date Wed Oct 28 00:09:46 CST 2015 [root@hp-dl388g8-15 ~]# LANG=C dd if=/dev/urandom of=nfs/testfile bs=1297 count=1 ^C^C^C^C^C^C <<<--- hangup, and could not be killed by Ctr+C Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHSA-2015-2152.html |