Bug 1271759
| Summary: | [nfs] writing to a udp6 mount results in a lockup | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 7 | Reporter: | Artem Savkov <asavkov> | ||||||
| Component: | kernel | Assignee: | Benjamin Coddington <bcodding> | ||||||
| kernel sub component: | NFS | QA Contact: | JianHong Yin <jiyin> | ||||||
| Status: | CLOSED ERRATA | Docs Contact: | |||||||
| Severity: | high | ||||||||
| Priority: | unspecified | CC: | aquini, bcodding, eguan, hsowa, jbrouer, jburke, jiji, jiyin, jstancek, kzhang, lmiksik, nfs-maint, pbunyan, sdubroca, steved, swhiteho, vyasevic | ||||||
| Version: | 7.2 | Keywords: | Regression | ||||||
| Target Milestone: | rc | ||||||||
| Target Release: | --- | ||||||||
| Hardware: | All | ||||||||
| OS: | Linux | ||||||||
| Whiteboard: | |||||||||
| Fixed In Version: | kernel-3.10.0-326.el7 | Doc Type: | Bug Fix | ||||||
| Doc Text: | Story Points: | --- | |||||||
| Clone Of: | Environment: | ||||||||
| Last Closed: | 2015-11-19 23:22:13 UTC | Type: | Bug | ||||||
| Regression: | --- | Mount Type: | --- | ||||||
| Documentation: | --- | CRM: | |||||||
| Verified Versions: | Category: | --- | |||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||
| Embargoed: | |||||||||
| Attachments: | 
 | ||||||||
| 3.10.0-240.el7 is first bad kernel. Reproduced.. (In reply to Jan Stancek from comment #4) > 3.10.0-240.el7 is first bad kernel. It appears to be related to partial UDP checksums introduced in this kernel: BAD: a29633df44be [net] ipv6: Partial checksum only UDP packets 7c921d059440 [net] ipv6: Allow for partial checksums on non-ufo packets GOOD: 4eab0ea762e9 [net] udpv6: Add lockless sendmsg() support I could trigger it on upstream 4.3.0-rc5+ as well. (In reply to Steve Dickson from comment #7) > > BAD: > > a29633df44be [net] ipv6: Partial checksum only UDP packets > > 7c921d059440 [net] ipv6: Allow for partial checksums on non-ufo packets > Should we back these out?? I can confirm that reverting those 2 (along with "acdf399 - [net] ipv6: Fix udp checksums with raw sockets") does resolve the issue. (In reply to Artem Savkov from comment #8) > (In reply to Steve Dickson from comment #7) > > > BAD: > > > a29633df44be [net] ipv6: Partial checksum only UDP packets > > > 7c921d059440 [net] ipv6: Allow for partial checksums on non-ufo packets > > Should we back these out?? > > I can confirm that reverting those 2 (along with "acdf399 - [net] ipv6: Fix > udp checksums with raw sockets") does resolve the issue. Yes, I revert above two commits on kernel-3.10.0-324.el7, it works well. Maybe we need to dig out the deeper reason. Client is sending 3 frags (they look right):
 70  94.851257   fc00::3:72 -> fc00::3:73   IPv6 1510 IPv6 fragment (nxt=UDP (17) off=0 id=0xeb0c4d34)
 71  94.851361   fc00::3:72 -> fc00::3:73   IPv6 1510 IPv6 fragment (nxt=UDP (17) off=181 id=0xeb0c4d34)
 72  94.851381   fc00::3:72 -> fc00::3:73   NFS 1322 [RPC retransmission of #3]V3 WRITE Call, FH: 0xafe34795 Offset: 0 Len: 4000 FILE_SYNC
Server is using kernel_recvmsg() and getting -EAGAIN back each time these frames are transmitted:
net/sunrpc/svcsock.c:
571     skb = NULL;
572     err = kernel_recvmsg(svsk->sk_sock, &msg, NULL,
573                  0, 0, MSG_PEEK | MSG_DONTWAIT);
574     if (err >= 0)
575         skb = skb_recv_datagram(svsk->sk_sk, 0, 1, &err);
576
577     if (skb == NULL) {
578         if (err != -EAGAIN) {
579             /* possibly an icmp error */
580             dprintk("svc: recvfrom returned error %d\n", -err);
581             set_bit(XPT_DATA, &svsk->sk_xprt.xpt_flags);
582         }
583         return 0;
584     }
Traced this back into the server's udpv6_recvmsg().. in the failing case it appears we are jumping to csum_copy_err in this section:
net/ipv6/udp.c:
432     if (copied < ulen || UDP_SKB_CB(skb)->partial_cov) {
433         if (udp_lib_checksum_complete(skb))
434             goto csum_copy_err;
435     }
I had a look at the patch in 7c921d059440: + /* If this is the first and only packet and device + * supports checksum offloading, let's use it. + */ + if (!skb && + length + fragheaderlen < mtu && + rt->dst.dev->features & NETIF_F_V6_CSUM && + !exthdrlen) + csummode = CHECKSUM_PARTIAL; I guess maybe (not sure): "length + fragheaderlen < mtu" should be "length + fragheaderlen <= mtu" Created attachment 1083670 [details]
rhel67 write capture
Created attachment 1083671 [details]
rhel72 write capture
Two attachments: attachment 1083670 [details] contains 3 udp frames for an NFS write from a RHEL6.7 NFS client. This WRITE causes the server to respond appropriately attachment 1083671 [details] contains 3 udp frames for an NFS write from a RHEL7.2 NFS client that has BAD: a29633df44be [net] ipv6: Partial checksum only UDP packets 7c921d059440 [net] ipv6: Allow for partial checksums on non-ufo packets This WRITE causes the server to never respond.. the server seems to be behaving as in comment 10. An upstream server behaves the same way. Something is different in these two captures that is causing the server to not respond, but I am not able to find it! The captures were recorded from within the VM on the server side.. ..hmm maybe I need to capture externally to the server. I'll try that next. A capture external to my VMs shows that in the failing case the WRITE tx UDP checksum is incorrect. On the RHEL7.2 client: [root@redhat-72 ~]# ethtool -i eno16777736 driver: e1000 version: 7.3.21-k8-NAPI firmware-version: bus-info: 0000:02:01.0 supports-statistics: yes supports-test: yes supports-eeprom-access: yes supports-register-dump: yes supports-priv-flags: no [root@redhat-72 ~]# ethtool -k eno16777736 |grep tx-check tx-checksumming: on tx-checksum-ipv4: off [fixed] tx-checksum-ip-generic: on tx-checksum-ipv6: off [fixed] tx-checksum-fcoe-crc: off [fixed] tx-checksum-sctp: off [fixed] As soon as I run `ethtool -K eno16777736 tx off` the client's IO completes properly and the problem is not reproduced. I'm suspecting that the combination of udp6 fragments and checksum offload is the key here.. 
net/ipv6/ip6_output.c:
1256     /* If this is the first and only packet and device
1257      * supports checksum offloading, let's use it.
1258      * Use transhdrlen, same as IPv4, because partial
1259      * sums only work when transhdrlen is set.
1260      */
1261     if (transhdrlen && sk->sk_protocol == IPPROTO_UDP &&
1262         length + fragheaderlen < mtu &&
1263         rt->dst.dev->features & NETIF_F_V6_CSUM &&
1264         !exthdrlen)
1265         csummode = CHECKSUM_PARTIAL;
Just before this check, a probe shows that the length var is short enough to be under mtu, so we set ip_summed to CHECKSUM_PARTIAL.  Looks like this is because xs_sendpages() first uses kernel_sendmsg() to send along NFS protocol headers, then decides to use sock_sendpage() or sock_no_sendpage() to send along the page data.
kworker/0:1H   656 [000] 78128.764564: probe:udpv6_sendmsg: (ffffffff815f2a40) len=88
                  7f2a41 udpv6_sendmsg (/usr/lib/debug/lib/modules/3.10.0-324.el7.x86_64/vmlinux)
                  710c00 sock_sendmsg (/usr/lib/debug/lib/modules/3.10.0-324.el7.x86_64/vmlinux)
                  710c77 kernel_sendmsg (/usr/lib/debug/lib/modules/3.10.0-324.el7.x86_64/vmlinux)
                    6f59 xs_send_kvec ([sunrpc])
                    70c7 xs_sendpages ([sunrpc])
                    72a9 xs_udp_send_request ([sunrpc])
                    5586 xprt_transmit ([sunrpc])
                    1868 call_transmit ([sunrpc])
                    bbf4 __rpc_execute ([sunrpc])
                    bfa6 rpc_async_schedule ([sunrpc])
                  29d5fb process_one_work (/usr/lib/debug/lib/modules/3.10.0-324.el7.x86_64/vmlinux)
                  29e3cb worker_thread (/usr/lib/debug/lib/modules/3.10.0-324.el7.x86_64/vmlinux)
                  2a5aef kthread (/usr/lib/debug/lib/modules/3.10.0-324.el7.x86_64/vmlinux)
                  845718 ret_from_fork (/usr/lib/debug/lib/modules/3.10.0-324.el7.x86_64/vmlinux)
kworker/0:1H   656 [000] 78128.764605: probe:__ip6_append_data_2: (ffffffff815d7dd7) transhdrlen=8 length=144 fragheaderlen=28
                  7d7dd8 __ip6_append_data.isra.32 (/usr/lib/debug/lib/modules/3.10.0-324.el7.x86_64/vmlinux)
                  7d866d ip6_append_data (/usr/lib/debug/lib/modules/3.10.0-324.el7.x86_64/vmlinux)
                  7f2cc5 udpv6_sendmsg (/usr/lib/debug/lib/modules/3.10.0-324.el7.x86_64/vmlinux)
                  7a0e24 inet_sendmsg (/usr/lib/debug/lib/modules/3.10.0-324.el7.x86_64/vmlinux)
                  710c00 sock_sendmsg (/usr/lib/debug/lib/modules/3.10.0-324.el7.x86_64/vmlinux)
                  710c77 kernel_sendmsg (/usr/lib/debug/lib/modules/3.10.0-324.el7.x86_64/vmlinux)
                    6f59 xs_send_kvec ([sunrpc])
                    70c7 xs_sendpages ([sunrpc])
                    72a9 xs_udp_send_request ([sunrpc])
                    5586 xprt_transmit ([sunrpc])
                    1868 call_transmit ([sunrpc])
                    bbf4 __rpc_execute ([sunrpc])
                    bfa6 rpc_async_schedule ([sunrpc])
                  29d5fb process_one_work (/usr/lib/debug/lib/modules/3.10.0-324.el7.x86_64/vmlinux)
                  29e3cb worker_thread (/usr/lib/debug/lib/modules/3.10.0-324.el7.x86_64/vmlinux)
                  2a5aef kthread (/usr/lib/debug/lib/modules/3.10.0-324.el7.x86_64/vmlinux)
                  845718 ret_from_fork (/usr/lib/debug/lib/modules/3.10.0-324.el7.x86_64/vmlinux)
kworker/0:1H   656 [000] 78128.764615: probe:udpv6_sendmsg: (ffffffff815f2a40) len=fa0
                  7f2a41 udpv6_sendmsg (/usr/lib/debug/lib/modules/3.10.0-324.el7.x86_64/vmlinux)
                  710c00 sock_sendmsg (/usr/lib/debug/lib/modules/3.10.0-324.el7.x86_64/vmlinux)
                  710c77 kernel_sendmsg (/usr/lib/debug/lib/modules/3.10.0-324.el7.x86_64/vmlinux)
                  7130f9 sock_no_sendpage (/usr/lib/debug/lib/modules/3.10.0-324.el7.x86_64/vmlinux)
                    71c6 xs_sendpages ([sunrpc])
                    72a9 xs_udp_send_request ([sunrpc])
                    5586 xprt_transmit ([sunrpc])
                    1868 call_transmit ([sunrpc])
                    bbf4 __rpc_execute ([sunrpc])
                    bfa6 rpc_async_schedule ([sunrpc])
                  29d5fb process_one_work (/usr/lib/debug/lib/modules/3.10.0-324.el7.x86_64/vmlinux)
                  29e3cb worker_thread (/usr/lib/debug/lib/modules/3.10.0-324.el7.x86_64/vmlinux)
                  2a5aef kthread (/usr/lib/debug/lib/modules/3.10.0-324.el7.x86_64/vmlinux)
                  845718 ret_from_fork (/usr/lib/debug/lib/modules/3.10.0-324.el7.x86_64/vmlinux)
So the first pass through __ip6_append_data() gets us the ip_summed == CHECKSUM_PARTIAL, and then the additional data becomes large enough to fragment.
(In reply to Benjamin Coddington from comment #25) > So the first pass through __ip6_append_data() gets us the ip_summed == > CHECKSUM_PARTIAL, and then the additional data becomes large enough to > fragment. So, that's ok. When udp_v6_push_pending_frames() is called, we'll end up calling udp_v6_send_skb() which should correct the checksum in the above case (see udp6_hwcsum_outgoing). We should be dropping the partial checksum and re-calculating the software checksum... I am trying to duplicate this issue with unit tests and I think it might depend on the hw.... I am going to try on a few different nics... I've acked the fix Hannes proposed, but I am not sure if that's the real issue ATM. -vlad Patch(es) available on kernel-3.10.0-326.el7 Verified:
[root@hp-dl388g8-18 ~]# mkdir nfs
[root@hp-dl388g8-18 ~]# mount -t nfs -o nfsvers=3,proto=udp6 rhel6-nfs:/export/home nfs
mount.nfs: Failed to resolve server rhel6-nfs: Name or service not known
[root@hp-dl388g8-18 ~]# hostname
hp-dl388g8-18.rhts.eng.pek2.redhat.com
[root@hp-dl388g8-18 ~]# mount -t nfs -o nfsvers=3,proto=udp6 rhel6-nfs.rhts.eng.bos.redhat.com:/export/home nfs
[root@hp-dl388g8-18 ~]# LANG=C dd if=/dev/urandom of=nfs/testfile bs=1297 count=1
1+0 records in
1+0 records out
1297 bytes (1.3 kB) copied, 0.293311 s, 4.4 kB/s
[root@hp-dl388g8-18 ~]# uname -r
3.10.0-326.el7.x86_64
reproduced:
      Beaker Test information:
                         HOSTNAME=hp-dl388g8-15.rhts.eng.pek2.redhat.com
                            JOBID=1127329
                         RECIPEID=2293791
                    RESULT_SERVER=[::1]:7096
                           DISTRO=RHEL-7.2-20151008.0
                     ARCHITECTURE=x86_64
      Job Whiteboard: RHEL-7 ss4
      Recipe Whiteboard: 
**  **  **  **  **  **  **  **  **  **  **  **  **  **  **  **  **  **
[root@hp-dl388g8-15 ~]# 
[root@hp-dl388g8-15 ~]# 
[root@hp-dl388g8-15 ~]# mkdir nfs
[root@hp-dl388g8-15 ~]# mount -t nfs -o nfsvers=3,proto=udp6 rhel6-nfs:/export/home nfs^C
[root@hp-dl388g8-15 ~]# mount -t nfs -o nfsvers=3,proto=udp6 rhel6-nfs.rhts.eng.bos.redhat.com:/export/home nfs
[root@hp-dl388g8-15 ~]# LANG=C date
Wed Oct 28 00:09:46 CST 2015
[root@hp-dl388g8-15 ~]# LANG=C dd if=/dev/urandom of=nfs/testfile bs=1297 count=1
^C^C^C^C^C^C   <<<--- hangup, and could not be killed by Ctr+C
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHSA-2015-2152.html | 
Description of problem: Trying to write to an nfs3 mount mounted with proto=udp6 option results in a hung task with the following call traces in dmesg: [ 722.635750] INFO: task test5:17820 blocked for more than 120 seconds. [ 722.642204] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 722.650041] test5 D ffff88040958bd40 11808 17820 17811 0x00000080 [ 722.657140] ffff88040958bbc0 0000000000000046 ffff88040958bfd8 ffff88040958bfd8 [ 722.664616] ffff88040958bfd8 ffff88040958bfd8 ffff880822228000 ffff880822ad4f40 [ 722.672072] ffff88082bbd5600 0000000000000000 7fffffffffffffff ffffffff811aac70 [ 722.679557] Call Trace: [ 722.682026] [<ffffffff811aac70>] ? wait_on_page_read+0x60/0x60 [ 722.687954] [<ffffffff81707c49>] schedule+0x29/0x70 [ 722.692929] [<ffffffff817036f1>] schedule_timeout+0x1c1/0x360 [ 722.698773] [<ffffffff81025905>] ? native_sched_clock+0x35/0x80 [ 722.704785] [<ffffffff81025959>] ? sched_clock+0x9/0x10 [ 722.710103] [<ffffffff8170a2a6>] ? _raw_spin_unlock_irqrestore+0x36/0x70 [ 722.716895] [<ffffffff810253e9>] ? read_tsc+0x9/0x10 [ 722.721955] [<ffffffff810ff2cc>] ? ktime_get_ts64+0x4c/0xf0 [ 722.727620] [<ffffffff811aac70>] ? wait_on_page_read+0x60/0x60 [ 722.733544] [<ffffffff8170716e>] io_schedule_timeout+0xae/0x130 [ 722.739556] [<ffffffff81707208>] io_schedule+0x18/0x1a [ 722.744790] [<ffffffff811aac7e>] sleep_on_page+0xe/0x20 [ 722.750110] [<ffffffff81703950>] __wait_on_bit+0x60/0x90 [ 722.755518] [<ffffffff811acf2a>] ? find_get_pages_tag+0x21a/0x360 [ 722.761703] [<ffffffff811aa9e6>] wait_on_page_bit+0x86/0xb0 [ 722.767369] [<ffffffff810c10d0>] ? wake_atomic_t_function+0x40/0x40 [ 722.773731] [<ffffffff811aab21>] filemap_fdatawait_range+0x111/0x1d0 [ 722.780179] [<ffffffff811ad982>] filemap_write_and_wait_range+0x42/0x70 [ 722.786903] [<ffffffffa05dac20>] nfs_file_fsync+0x80/0x2b0 [nfs] [ 722.793007] [<ffffffff812695bb>] vfs_fsync+0x2b/0x40 [ 722.798072] [<ffffffffa05d98ba>] nfs_file_flush+0x7a/0xb0 [nfs] [ 722.804086] [<ffffffff8122f267>] filp_close+0x37/0x90 [ 722.809232] [<ffffffff812544ea>] __close_fd+0x8a/0x140 [ 722.814466] [<ffffffff812309f3>] SyS_close+0x23/0x50 [ 722.819525] [<ffffffff817149c9>] system_call_fastpath+0x16/0x1b [ 722.825537] no locks held by test5/17820. Version-Release number of selected component (if applicable): kernel-3.10.0-323.el7 How reproducible: 100% Steps to Reproduce: 1. mkdir nfs 2. mount -t nfs -o nfsvers=3,proto=udp6 rhel6-nfs:/export/home nfs 3. dd if=/dev/urandom of=nfs/testfile bs=1297 count=1 Actual results: task hangs while closing the file Expected results: task exits normally Additional info: 1297 bytes is a magic number after which the problem shows up, copying over 1296(and less) works fine. Data should be written in a quick succession, so dding with bs=1297 count=1 and bs=1 count=1297 from /dev/urandom or /dev/zero will result in a hang, but dding 8000 bytes from /dev/random does not. Same test on a system with 7.1z kernel (e.g. kernel-3.10.0-229.16.1.el7) works without an issue. [ 9210.221124] Showing all locks held in the system: [ 9210.227377] 2 locks held by agetty/1035: [ 9210.231287] #0: (&tty->ldisc_sem){++++++}, at: [<ffffffff814318f3>] tty_ldisc_ref_wait+0x23/0x50 [ 9210.240275] #1: (&ldata->atomic_read_lock){+.+...}, at: [<ffffffff8142e8ad>] n_tty_read+0x5fd/0x940 [ 9210.249522] 2 locks held by agetty/1036: [ 9210.253431] #0: (&tty->ldisc_sem){++++++}, at: [<ffffffff814318f3>] tty_ldisc_ref_wait+0x23/0x50 [ 9210.262417] #1: (&ldata->atomic_read_lock){+.+...}, at: [<ffffffff8142e8ad>] n_tty_read+0x5fd/0x940 [ 9210.271682] 3 locks held by 20_sysinfo/34403: [ 9210.276024] #0: (sb_writers#4){.+.+.+}, at: [<ffffffff812317bb>] vfs_write+0x1bb/0x1f0 [ 9210.284153] #1: (sysrq_key_table_lock){......}, at: [<ffffffff81436404>] __handle_sysrq+0x24/0x170 [ 9210.293311] #2: (tasklist_lock){.+.+..}, at: [<ffffffff8110ca22>] debug_show_all_locks+0x52/0x2c0