Bug 1271759

Summary:

[nfs] writing to a udp6 mount results in a lockup

Product:

Red Hat Enterprise Linux 7

Reporter:

Artem Savkov <asavkov>

Component:

kernel

Assignee:

Benjamin Coddington <bcodding>

kernel sub component:

NFS

QA Contact:

JianHong Yin <jiyin>

Status:

CLOSED ERRATA

Docs Contact:

Severity:

high

Priority:

unspecified

CC:

aquini, bcodding, eguan, hsowa, jbrouer, jburke, jiji, jiyin, jstancek, kzhang, lmiksik, nfs-maint, pbunyan, sdubroca, steved, swhiteho, vyasevic

Version:

7.2

Keywords:

Regression

Target Milestone:

Target Release:

---

Hardware:

All

OS:

Linux

Whiteboard:

Fixed In Version:

kernel-3.10.0-326.el7

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2015-11-19 23:22:13 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
rhel67 write capture	none
rhel72 write capture	none

Description Artem Savkov 2015-10-14 15:34:16 UTC

Description of problem:
Trying to write to an nfs3 mount mounted with proto=udp6 option results in a hung task with the following call traces in dmesg:

[  722.635750] INFO: task test5:17820 blocked for more than 120 seconds. 
[  722.642204] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 
[  722.650041] test5           D ffff88040958bd40 11808 17820  17811 0x00000080 
[  722.657140]  ffff88040958bbc0 0000000000000046 ffff88040958bfd8 ffff88040958bfd8 
[  722.664616]  ffff88040958bfd8 ffff88040958bfd8 ffff880822228000 ffff880822ad4f40 
[  722.672072]  ffff88082bbd5600 0000000000000000 7fffffffffffffff ffffffff811aac70 
[  722.679557] Call Trace: 
[  722.682026]  [<ffffffff811aac70>] ? wait_on_page_read+0x60/0x60 
[  722.687954]  [<ffffffff81707c49>] schedule+0x29/0x70 
[  722.692929]  [<ffffffff817036f1>] schedule_timeout+0x1c1/0x360 
[  722.698773]  [<ffffffff81025905>] ? native_sched_clock+0x35/0x80 
[  722.704785]  [<ffffffff81025959>] ? sched_clock+0x9/0x10 
[  722.710103]  [<ffffffff8170a2a6>] ? _raw_spin_unlock_irqrestore+0x36/0x70 
[  722.716895]  [<ffffffff810253e9>] ? read_tsc+0x9/0x10 
[  722.721955]  [<ffffffff810ff2cc>] ? ktime_get_ts64+0x4c/0xf0 
[  722.727620]  [<ffffffff811aac70>] ? wait_on_page_read+0x60/0x60 
[  722.733544]  [<ffffffff8170716e>] io_schedule_timeout+0xae/0x130 
[  722.739556]  [<ffffffff81707208>] io_schedule+0x18/0x1a 
[  722.744790]  [<ffffffff811aac7e>] sleep_on_page+0xe/0x20 
[  722.750110]  [<ffffffff81703950>] __wait_on_bit+0x60/0x90 
[  722.755518]  [<ffffffff811acf2a>] ? find_get_pages_tag+0x21a/0x360 
[  722.761703]  [<ffffffff811aa9e6>] wait_on_page_bit+0x86/0xb0 
[  722.767369]  [<ffffffff810c10d0>] ? wake_atomic_t_function+0x40/0x40 
[  722.773731]  [<ffffffff811aab21>] filemap_fdatawait_range+0x111/0x1d0 
[  722.780179]  [<ffffffff811ad982>] filemap_write_and_wait_range+0x42/0x70 
[  722.786903]  [<ffffffffa05dac20>] nfs_file_fsync+0x80/0x2b0 [nfs] 
[  722.793007]  [<ffffffff812695bb>] vfs_fsync+0x2b/0x40 
[  722.798072]  [<ffffffffa05d98ba>] nfs_file_flush+0x7a/0xb0 [nfs] 
[  722.804086]  [<ffffffff8122f267>] filp_close+0x37/0x90 
[  722.809232]  [<ffffffff812544ea>] __close_fd+0x8a/0x140 
[  722.814466]  [<ffffffff812309f3>] SyS_close+0x23/0x50 
[  722.819525]  [<ffffffff817149c9>] system_call_fastpath+0x16/0x1b 
[  722.825537] no locks held by test5/17820. 


Version-Release number of selected component (if applicable):
kernel-3.10.0-323.el7

How reproducible:
100%

Steps to Reproduce:
1. mkdir nfs
2. mount -t nfs -o nfsvers=3,proto=udp6 rhel6-nfs:/export/home nfs
3. dd if=/dev/urandom of=nfs/testfile bs=1297 count=1

Actual results:
task hangs while closing the file

Expected results:
task exits normally

Additional info:

1297 bytes is a magic number after which the problem shows up, copying over 1296(and less) works fine.
Data should be written in a quick succession, so dding with bs=1297 count=1 and bs=1 count=1297 from /dev/urandom or /dev/zero will result in a hang, but dding 8000 bytes from /dev/random does not.
Same test on a system with 7.1z kernel (e.g. kernel-3.10.0-229.16.1.el7) works without an issue.

[ 9210.221124] Showing all locks held in the system: 
[ 9210.227377] 2 locks held by agetty/1035: 
[ 9210.231287]  #0:  (&tty->ldisc_sem){++++++}, at: [<ffffffff814318f3>] tty_ldisc_ref_wait+0x23/0x50 
[ 9210.240275]  #1:  (&ldata->atomic_read_lock){+.+...}, at: [<ffffffff8142e8ad>] n_tty_read+0x5fd/0x940 
[ 9210.249522] 2 locks held by agetty/1036: 
[ 9210.253431]  #0:  (&tty->ldisc_sem){++++++}, at: [<ffffffff814318f3>] tty_ldisc_ref_wait+0x23/0x50 
[ 9210.262417]  #1:  (&ldata->atomic_read_lock){+.+...}, at: [<ffffffff8142e8ad>] n_tty_read+0x5fd/0x940 
[ 9210.271682] 3 locks held by 20_sysinfo/34403: 
[ 9210.276024]  #0:  (sb_writers#4){.+.+.+}, at: [<ffffffff812317bb>] vfs_write+0x1bb/0x1f0 
[ 9210.284153]  #1:  (sysrq_key_table_lock){......}, at: [<ffffffff81436404>] __handle_sysrq+0x24/0x170 
[ 9210.293311]  #2:  (tasklist_lock){.+.+..}, at: [<ffffffff8110ca22>] debug_show_all_locks+0x52/0x2c0

Comment 4 Jan Stancek 2015-10-15 10:22:17 UTC

3.10.0-240.el7 is first bad kernel.

Comment 5 Benjamin Coddington 2015-10-15 12:17:46 UTC

Reproduced..

Comment 6 Jan Stancek 2015-10-15 12:24:33 UTC

(In reply to Jan Stancek from comment #4)
> 3.10.0-240.el7 is first bad kernel.

It appears to be related to partial UDP checksums introduced in this kernel:

BAD:
a29633df44be [net] ipv6: Partial checksum only UDP packets
7c921d059440 [net] ipv6: Allow for partial checksums on non-ufo packets

GOOD:
4eab0ea762e9 [net] udpv6: Add lockless sendmsg() support

I could trigger it on upstream 4.3.0-rc5+ as well.

Comment 8 Artem Savkov 2015-10-15 14:17:26 UTC

(In reply to Steve Dickson from comment #7)
> > BAD:
> > a29633df44be [net] ipv6: Partial checksum only UDP packets
> > 7c921d059440 [net] ipv6: Allow for partial checksums on non-ufo packets
> Should we back these out??

I can confirm that reverting those 2 (along with "acdf399 - [net] ipv6: Fix udp checksums with raw sockets") does resolve the issue.

Comment 9 Jianwen Ji 2015-10-15 16:24:44 UTC

(In reply to Artem Savkov from comment #8)
> (In reply to Steve Dickson from comment #7)
> > > BAD:
> > > a29633df44be [net] ipv6: Partial checksum only UDP packets
> > > 7c921d059440 [net] ipv6: Allow for partial checksums on non-ufo packets
> > Should we back these out??
> 
> I can confirm that reverting those 2 (along with "acdf399 - [net] ipv6: Fix
> udp checksums with raw sockets") does resolve the issue.

Yes, I revert above two commits on kernel-3.10.0-324.el7, it works well. Maybe we need to dig out the deeper reason.

Comment 10 Benjamin Coddington 2015-10-15 19:05:01 UTC

Client is sending 3 frags (they look right):

 70  94.851257   fc00::3:72 -> fc00::3:73   IPv6 1510 IPv6 fragment (nxt=UDP (17) off=0 id=0xeb0c4d34)
 71  94.851361   fc00::3:72 -> fc00::3:73   IPv6 1510 IPv6 fragment (nxt=UDP (17) off=181 id=0xeb0c4d34)
 72  94.851381   fc00::3:72 -> fc00::3:73   NFS 1322 [RPC retransmission of #3]V3 WRITE Call, FH: 0xafe34795 Offset: 0 Len: 4000 FILE_SYNC

Server is using kernel_recvmsg() and getting -EAGAIN back each time these frames are transmitted:

net/sunrpc/svcsock.c:
571     skb = NULL;
572     err = kernel_recvmsg(svsk->sk_sock, &msg, NULL,
573                  0, 0, MSG_PEEK | MSG_DONTWAIT);
574     if (err >= 0)
575         skb = skb_recv_datagram(svsk->sk_sk, 0, 1, &err);
576
577     if (skb == NULL) {
578         if (err != -EAGAIN) {
579             /* possibly an icmp error */
580             dprintk("svc: recvfrom returned error %d\n", -err);
581             set_bit(XPT_DATA, &svsk->sk_xprt.xpt_flags);
582         }
583         return 0;
584     }

Traced this back into the server's udpv6_recvmsg().. in the failing case it appears we are jumping to csum_copy_err in this section:
net/ipv6/udp.c:
432     if (copied < ulen || UDP_SKB_CB(skb)->partial_cov) {
433         if (udp_lib_checksum_complete(skb))
434             goto csum_copy_err;
435     }

Comment 12 JianHong Yin 2015-10-16 02:14:29 UTC

I had a look at the patch in 7c921d059440:

+       /* If this is the first and only packet and device
+        * supports checksum offloading, let's use it.
+        */
+       if (!skb &&
+           length + fragheaderlen < mtu &&
+           rt->dst.dev->features & NETIF_F_V6_CSUM &&
+           !exthdrlen)
+               csummode = CHECKSUM_PARTIAL;

I guess maybe (not sure):
  "length + fragheaderlen < mtu"    should be
  "length + fragheaderlen <= mtu"

Comment 14 Benjamin Coddington 2015-10-16 13:55:00 UTC

Created attachment 1083670 [details]
rhel67 write capture

Comment 15 Benjamin Coddington 2015-10-16 13:55:38 UTC

Created attachment 1083671 [details]
rhel72 write capture

Comment 16 Benjamin Coddington 2015-10-16 14:02:13 UTC

Two attachments:

 attachment 1083670 [details] contains 3 udp frames for an NFS write from a RHEL6.7 NFS client.  This WRITE causes the server to respond appropriately

 attachment 1083671 [details] contains 3 udp frames for an NFS write from a RHEL7.2 NFS client that has
BAD:
a29633df44be [net] ipv6: Partial checksum only UDP packets
7c921d059440 [net] ipv6: Allow for partial checksums on non-ufo packets

This WRITE causes the server to never respond.. the server seems to be behaving as in comment 10.  An upstream server behaves the same way.

Something is different in these two captures that is causing the server to not respond, but I am not able to find it!  The captures were recorded from within the VM on the server side..

..hmm maybe I need to capture externally to the server.  I'll try that next.

Comment 20 Benjamin Coddington 2015-10-16 15:18:52 UTC

A capture external to my VMs shows that in the failing case the WRITE tx UDP checksum is incorrect.

On the RHEL7.2 client:
[root@redhat-72 ~]# ethtool -i eno16777736
driver: e1000
version: 7.3.21-k8-NAPI
firmware-version:
bus-info: 0000:02:01.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: no

[root@redhat-72 ~]# ethtool -k eno16777736 |grep tx-check
tx-checksumming: on
	tx-checksum-ipv4: off [fixed]
	tx-checksum-ip-generic: on
	tx-checksum-ipv6: off [fixed]
	tx-checksum-fcoe-crc: off [fixed]
	tx-checksum-sctp: off [fixed]


As soon as I run `ethtool -K eno16777736 tx off` the client's IO completes properly and the problem is not reproduced.

I'm suspecting that the combination of udp6 fragments and checksum offload is the key here..

Comment 25 Benjamin Coddington 2015-10-20 13:56:14 UTC

net/ipv6/ip6_output.c:
1256     /* If this is the first and only packet and device
1257      * supports checksum offloading, let's use it.
1258      * Use transhdrlen, same as IPv4, because partial
1259      * sums only work when transhdrlen is set.
1260      */
1261     if (transhdrlen && sk->sk_protocol == IPPROTO_UDP &&
1262         length + fragheaderlen < mtu &&
1263         rt->dst.dev->features & NETIF_F_V6_CSUM &&
1264         !exthdrlen)
1265         csummode = CHECKSUM_PARTIAL;

Just before this check, a probe shows that the length var is short enough to be under mtu, so we set ip_summed to CHECKSUM_PARTIAL.  Looks like this is because xs_sendpages() first uses kernel_sendmsg() to send along NFS protocol headers, then decides to use sock_sendpage() or sock_no_sendpage() to send along the page data.

kworker/0:1H   656 [000] 78128.764564: probe:udpv6_sendmsg: (ffffffff815f2a40) len=88
                  7f2a41 udpv6_sendmsg (/usr/lib/debug/lib/modules/3.10.0-324.el7.x86_64/vmlinux)
                  710c00 sock_sendmsg (/usr/lib/debug/lib/modules/3.10.0-324.el7.x86_64/vmlinux)
                  710c77 kernel_sendmsg (/usr/lib/debug/lib/modules/3.10.0-324.el7.x86_64/vmlinux)
                    6f59 xs_send_kvec ([sunrpc])
                    70c7 xs_sendpages ([sunrpc])
                    72a9 xs_udp_send_request ([sunrpc])
                    5586 xprt_transmit ([sunrpc])
                    1868 call_transmit ([sunrpc])
                    bbf4 __rpc_execute ([sunrpc])
                    bfa6 rpc_async_schedule ([sunrpc])
                  29d5fb process_one_work (/usr/lib/debug/lib/modules/3.10.0-324.el7.x86_64/vmlinux)
                  29e3cb worker_thread (/usr/lib/debug/lib/modules/3.10.0-324.el7.x86_64/vmlinux)
                  2a5aef kthread (/usr/lib/debug/lib/modules/3.10.0-324.el7.x86_64/vmlinux)
                  845718 ret_from_fork (/usr/lib/debug/lib/modules/3.10.0-324.el7.x86_64/vmlinux)

kworker/0:1H   656 [000] 78128.764605: probe:__ip6_append_data_2: (ffffffff815d7dd7) transhdrlen=8 length=144 fragheaderlen=28
                  7d7dd8 __ip6_append_data.isra.32 (/usr/lib/debug/lib/modules/3.10.0-324.el7.x86_64/vmlinux)
                  7d866d ip6_append_data (/usr/lib/debug/lib/modules/3.10.0-324.el7.x86_64/vmlinux)
                  7f2cc5 udpv6_sendmsg (/usr/lib/debug/lib/modules/3.10.0-324.el7.x86_64/vmlinux)
                  7a0e24 inet_sendmsg (/usr/lib/debug/lib/modules/3.10.0-324.el7.x86_64/vmlinux)
                  710c00 sock_sendmsg (/usr/lib/debug/lib/modules/3.10.0-324.el7.x86_64/vmlinux)
                  710c77 kernel_sendmsg (/usr/lib/debug/lib/modules/3.10.0-324.el7.x86_64/vmlinux)
                    6f59 xs_send_kvec ([sunrpc])
                    70c7 xs_sendpages ([sunrpc])
                    72a9 xs_udp_send_request ([sunrpc])
                    5586 xprt_transmit ([sunrpc])
                    1868 call_transmit ([sunrpc])
                    bbf4 __rpc_execute ([sunrpc])
                    bfa6 rpc_async_schedule ([sunrpc])
                  29d5fb process_one_work (/usr/lib/debug/lib/modules/3.10.0-324.el7.x86_64/vmlinux)
                  29e3cb worker_thread (/usr/lib/debug/lib/modules/3.10.0-324.el7.x86_64/vmlinux)
                  2a5aef kthread (/usr/lib/debug/lib/modules/3.10.0-324.el7.x86_64/vmlinux)
                  845718 ret_from_fork (/usr/lib/debug/lib/modules/3.10.0-324.el7.x86_64/vmlinux)

kworker/0:1H   656 [000] 78128.764615: probe:udpv6_sendmsg: (ffffffff815f2a40) len=fa0
                  7f2a41 udpv6_sendmsg (/usr/lib/debug/lib/modules/3.10.0-324.el7.x86_64/vmlinux)
                  710c00 sock_sendmsg (/usr/lib/debug/lib/modules/3.10.0-324.el7.x86_64/vmlinux)
                  710c77 kernel_sendmsg (/usr/lib/debug/lib/modules/3.10.0-324.el7.x86_64/vmlinux)
                  7130f9 sock_no_sendpage (/usr/lib/debug/lib/modules/3.10.0-324.el7.x86_64/vmlinux)
                    71c6 xs_sendpages ([sunrpc])
                    72a9 xs_udp_send_request ([sunrpc])
                    5586 xprt_transmit ([sunrpc])
                    1868 call_transmit ([sunrpc])
                    bbf4 __rpc_execute ([sunrpc])
                    bfa6 rpc_async_schedule ([sunrpc])
                  29d5fb process_one_work (/usr/lib/debug/lib/modules/3.10.0-324.el7.x86_64/vmlinux)
                  29e3cb worker_thread (/usr/lib/debug/lib/modules/3.10.0-324.el7.x86_64/vmlinux)
                  2a5aef kthread (/usr/lib/debug/lib/modules/3.10.0-324.el7.x86_64/vmlinux)
                  845718 ret_from_fork (/usr/lib/debug/lib/modules/3.10.0-324.el7.x86_64/vmlinux)

So the first pass through __ip6_append_data() gets us the ip_summed == CHECKSUM_PARTIAL, and then the additional data becomes large enough to fragment.

Comment 32 Vlad Yasevich 2015-10-20 17:20:50 UTC

(In reply to Benjamin Coddington from comment #25) 
> So the first pass through __ip6_append_data() gets us the ip_summed ==
> CHECKSUM_PARTIAL, and then the additional data becomes large enough to
> fragment.

So, that's ok.  When udp_v6_push_pending_frames() is called, we'll end up
calling udp_v6_send_skb() which should correct the checksum in the above case
(see udp6_hwcsum_outgoing).  We should be dropping the partial checksum and
re-calculating the software checksum...

I am trying to duplicate this issue with unit tests and I think it might depend
on the hw....  I am going to try on a few different nics...

I've acked the fix Hannes proposed, but I am not sure if that's the real issue ATM.

-vlad

Comment 33 Rafael Aquini 2015-10-26 11:44:46 UTC

Patch(es) available on kernel-3.10.0-326.el7

Comment 36 JianHong Yin 2015-10-27 16:15:29 UTC

Verified:

[root@hp-dl388g8-18 ~]# mkdir nfs
[root@hp-dl388g8-18 ~]# mount -t nfs -o nfsvers=3,proto=udp6 rhel6-nfs:/export/home nfs
mount.nfs: Failed to resolve server rhel6-nfs: Name or service not known
[root@hp-dl388g8-18 ~]# hostname
hp-dl388g8-18.rhts.eng.pek2.redhat.com
[root@hp-dl388g8-18 ~]# mount -t nfs -o nfsvers=3,proto=udp6 rhel6-nfs.rhts.eng.bos.redhat.com:/export/home nfs
[root@hp-dl388g8-18 ~]# LANG=C dd if=/dev/urandom of=nfs/testfile bs=1297 count=1
1+0 records in
1+0 records out
1297 bytes (1.3 kB) copied, 0.293311 s, 4.4 kB/s
[root@hp-dl388g8-18 ~]# uname -r
3.10.0-326.el7.x86_64

reproduced:

      Beaker Test information:
                         HOSTNAME=hp-dl388g8-15.rhts.eng.pek2.redhat.com
                            JOBID=1127329
                         RECIPEID=2293791
                    RESULT_SERVER=[::1]:7096
                           DISTRO=RHEL-7.2-20151008.0
                     ARCHITECTURE=x86_64

      Job Whiteboard: RHEL-7 ss4

      Recipe Whiteboard: 
**  **  **  **  **  **  **  **  **  **  **  **  **  **  **  **  **  **
[root@hp-dl388g8-15 ~]# 
[root@hp-dl388g8-15 ~]# 
[root@hp-dl388g8-15 ~]# mkdir nfs
[root@hp-dl388g8-15 ~]# mount -t nfs -o nfsvers=3,proto=udp6 rhel6-nfs:/export/home nfs^C
[root@hp-dl388g8-15 ~]# mount -t nfs -o nfsvers=3,proto=udp6 rhel6-nfs.rhts.eng.bos.redhat.com:/export/home nfs
[root@hp-dl388g8-15 ~]# LANG=C date
Wed Oct 28 00:09:46 CST 2015
[root@hp-dl388g8-15 ~]# LANG=C dd if=/dev/urandom of=nfs/testfile bs=1297 count=1

^C^C^C^C^C^C   <<<--- hangup, and could not be killed by Ctr+C

Comment 40 errata-xmlrpc 2015-11-19 23:22:13 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2015-2152.html