Bug 781439

Summary: NFS race between retransmit execution and request unmapping on response arrival
Product: Red Hat Enterprise Linux 5 Reporter: Laszlo Ersek <lersek>
Component: kernelAssignee: Jeff Layton <jlayton>
Status: CLOSED WONTFIX QA Contact: Red Hat Kernel QE team <kernel-qe>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 5.7CC: jlayton, nfs-maint, nhorman, pbonzini, rwheeler, steved
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2013-11-20 14:17:00 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Laszlo Ersek 2012-01-13 12:23:13 UTC
From <http://marc.info/?l=linux-nfs&m=122424132729720&w=2>:

------------------------------------v------------------------------------
For some time now I have been tracking down a mysterious host crash when
using Xen with blktap (userspace paravirtual disk implementation) under
stress conditions. I have now managed to create a simple test case with
no Xen components which appears to show RPC retransmissions of write
requests containing invalid data. I'll start at the beginning and
explain the original bug although the test case is much simpler.

blktap is a userspace daemon which provides paravirtualised disk support
to Xen guests. The pages of data for a block write are mapped from the
Xen guest into a domain 0 driver called blktap which has a userspace
component which implements qcow/vhd/etc and writes the data to the
backing file using O_DIRECT and aio in a zero copy manner. Once the  the
aio is completed the pages are returned to the guest and unmapped from
domain 0. When a page is unmapped in this way the pte is set not present
and the PFN is mapped to an invalid MFN.

We have been seeing a crash in the domain 0 network driver's start_xmit
routine when it attempts to access data from a page where the PFN maps
to an invalid MFN. I added some tracing to the kernel and observed this
sequence of events on an individual page:

    tap:  0/41 at 4fac7a165ad6 "rpc_init_task" c8792844 0
    tap:  1/41 at 4fac7a1663e2 "nfs_direct_write_schedule" ece19850 2000
    tap:  2/41 at 4fac7a166cca "call_start" c8792844 0
    tap:  3/41 at 4fac7a167540 "call_reserve" c8792844 0
    tap:  4/41 at 4fac7a167de6 "call_reserveresult" c8792844 0
    tap:  5/41 at 4fac7a168620 "call_allocate" c8792844 0
    tap:  6/41 at 4fac7a168f08 "call_bind" c8792844 0
    tap:  7/41 at 4fac7a169712 "call_connect" c8792844 0
    tap:  8/41 at 4fac7a169f28 "call_transmit" c8792844 0
    tap:  9/41 at 4fac7a16a7f2 "call_encode" c8792844 0
    tap: 10/41 at 4fac7a16afd8 "call_header" c8792844 0
    tap: 11/41 at 4fac7a16bc6e "xs_tcp_send_request" c8792844 0
    tap: 12/41 at 4fac7a16c9d0 "tcp_sendpage" 0 0
    tap: 13/41 at 4fac7a16cec2 "do_tcp_sendpages (adding data to skb)"
                               cef08b00 0
    tap: 14/41 at 4fac7a16e068 "call_transmit_status" c8792844 0
    tap: 15/41 at 4fac7a2ed8f4 "tcp_transmit_skb, skb_clone" c9dca500
                               c9dca5a8
    tap: 16/41 at 4faeeeb9f566 "xprt_timer" c8792844 0
    tap: 17/41 at 4faeeeb9ff0e "xprt_timer: !req-&gt;rq_received" c8792844 0
    tap: 18/41 at 4faeeeba08ec "rpc_make_runnable" c8792844 0
    tap: 19/41 at 4faeeec117b8 "call_status" c8792844 ffffff92
    tap: 20/41 at 4faeeec11faa "timeout (minor)" c8792844 0
    tap: 21/41 at 4faeeec12778 "call_bind" c8792844 0
    tap: 22/41 at 4faeeec12ef8 "call_connect" c8792844 0
    tap: 23/41 at 4faeeec13678 "call_transmit" c8792844 0
    tap: 24/41 at 4faeeec13e46 "call_encode" c8792844 0
    tap: 25/41 at 4faeeec145ae "call_header" c8792844 0
    tap: 26/41 at 4faeeec15082 "xs_tcp_send_request" c8792844 0
    tap: 27/41 at 4faeeec15d1e "tcp_sendpage" 0 0
    tap: 28/41 at 4faeeec161ce "do_tcp_sendpages (adding data to skb)"
                               d06afe40 0
    tap: 29/41 at 4faeeec172ea "call_transmit_status" c8792844 0
    tap: 30/41 at 4faf2e3280d4 "rpc_make_runnable" c8792844 0
    tap: 31/41 at 4faf2e3440c4 "call_status" c8792844 88
    tap: 32/41 at 4faf2e3449d6 "call_decode" c8792844 0
    tap: 33/41 at 4faf2e345240 "call_verify" c8792844 0
    tap: 34/41 at 4faf2e345a9e "rpc_exit_task" c8792844 0
    tap: 35/41 at 4faf2e34652a "nfs_direct_write_result" ece19850 2000
    tap: 36/41 at 4faf2e351000 "nfs_direct_write_release (completing)"
                               ece19850 0
    tap: 37/41 at 4faf2e3517ec "nfs_direct_write_complete,DEFAULT w/ iocb"
                               ece19850 0
    tap: 38/41 at 4faf2e35205c "nfs_direct_free_writedata" ece19850 0
    tap: 39/41 at 4faf2e51026a "fast_flush_area" 0 0
    tap: 40/41 at 4faf33e1813a "tcp_transmit_skb, skb_clone" d06afe40
                               d06afee8

(nb: fast_flush_area is the blktap function which returns the pages to
the guest and unmaps them from domain 0, it is called via ioctl from the
userspace process once the aio write returns successfully. 4fac7.... is
the tsc, processor is 2.33GHz)

So what we see is the initial request being constructed and transmitted
(around 11/41-15/41) followed by a timeout ~60s later (16/41-20/41)
which causes us to queue a retransmit (26/41-29/41) but, critically, not
yet actually transmit it. At this point we get the reply to the original
request and complete the NFS write (35/41-38/41) returning success to
userspace which causes it to unmap the pages and return them to the
guest (39/41). At this point (40/41) we attempt to transmit the
duplicate request and crash because the pages are not present any more.

By using libnetfilter_queue I was then able to reproduce this in
non-stress situations by introducing delays into the network. Stalling
all network traffic with the NFS server for 65s every 1000 packets seems
to do the trick (probably on the aggressive side, I didn't attempt to
minimise the interruption required to reproduce and I realise that this
represents a pretty crappy network and/or server).

Given the observed sequence of events I then constructed a fairly simple
test program (blktest2.c, attached) using O_DIRECT (but not aio, I don't
think it matters) which shows the same issue without involving any Xen
parts. The test writes a buffer filed with a counter to a file and
immediately after the write() returns fills the buffer with 0xdeadbeef.
By using tcpdump I capture and observe duplicated requests on the wire
containing 0xdeadbeef in the payload and not the expected counter
values. I usually see this in a matter of minutes. I've attached a pcap
of a single request/reply pair which was corrupted.

Presumably in the case of a decent NFS server the XID request cache
would prevent the bogus data actually reaching the disk but on a
non-decent server I suspect it might actually lead to corruption (AIUI
the request cache is not a hard requirement of the NFS protocol?).
Perhaps even a decent server might have timed out the entry in the cache
after such a delay?

The Xen case and the blktest2 repro was on 2.6.18. I have also
reproduced the blktest2 case on 2.6.27 native but not with Xen since no
domain 0 support exists just yet.

I can think of several possible option to fix this issue:
      * Do not return success to userspace until all duplicated requests
        have actually hit the wire, even if the response comes back
        earlier than that.
      * Cancel queued duplicates if a response comes in late.
      * Copy the data pages on retransmission.

I think I prefer the 3rd option since the first two are a bit tricky
since request has been merged into a tcp stream and may have been
fragmented/segmented already etc. I don't think doing the copy on
retransmits has a massive performance impact -- you must already be
suffering from pretty severe network or server issues!

I have CC'd the maintainer of the bnx2 Ethernet driver (Michael Chan)
because I have so far only been able to reproduce this with the broadcom
NIC. Even in the same server if I switch to e1000 I cannot reproduce.
However given the analysis above I'm not convinced it is likely to be a
driver bug.
------------------------------------^------------------------------------

I believe this problem is in the background of bug 734188.

Patches have been posted upstream to fix this. The most recent series seems to be here: <http://thread.gmane.org/gmane.linux.network/217006>. The direct fix is 6/6, which is based on the first five patches in the series.

Comment 1 Laszlo Ersek 2012-01-13 13:02:36 UTC
Opening up, then.

Comment 2 Jeff Layton 2012-03-22 14:23:48 UTC
(cc'ing Neil in case he's interested)

Comment 7 RHEL Program Management 2013-11-20 14:17:00 UTC
Development Management has reviewed and declined this request.
You may appeal this decision by reopening this request.