Bug 470267 - cifs: data corruption due to interleaved partial writes timing out
cifs: data corruption due to interleaved partial writes timing out
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel (Show other bugs)
5.2
powerpc Linux
urgent Severity urgent
: rc
: ---
Assigned To: Jeff Layton
Martin Jenner
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2008-11-06 09:36 EST by Tomas Smetana
Modified: 2011-01-24 18:29 EST (History)
13 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2009-01-20 14:38:30 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
Proposed patch (12.73 KB, patch)
2008-11-06 09:39 EST, Tomas Smetana
no flags Details | Diff
Same as the previous one, but changed 3 seconds from 7 seconds in two places in connect.c in ipv6_connect (12.73 KB, patch)
2008-11-06 09:40 EST, Tomas Smetana
no flags Details | Diff
backport of all 3 patches (16.50 KB, patch)
2008-11-06 10:14 EST, Jeff Layton
no flags Details | Diff
backport of all 3 patches -- fix missing sk_rcvtimeo set too (16.50 KB, patch)
2008-11-06 13:46 EST, Jeff Layton
no flags Details | Diff
patch -- prevent cifs_writepages from skipping pages (1.68 KB, patch)
2008-12-17 06:47 EST, Jeff Layton
no flags Details | Diff

  None (edit)
Description Tomas Smetana 2008-11-06 09:36:53 EST
Firmware level PHYP, FSP, Etc: JS22 = 01EA330_031_031
Machine Type: JS22 (POWER)
OS: RHEL 5.2

Steps to recreate the problem:

Setup is JS22 running VIOS 815B with a partition running i5/OS and another partition running RHEL 5.2.

The RHEL 5.2 partition is using CIFS mounts to write data through the MPx test tool to multiple SMB
servers on other blades running various OSes.

While writing the data to the shares, MPx shows input/output errors and these messages show up in
the /var/log/messages
(note that the CIFS module code has been modified, these might not be standard messages)

kernel: CIFS VFS: smb_send2 on sock c00000003bb1bc80 stuck for 15 seconds
kernel: CIFS VFS: Error -11 sending data on socket to server
kernel: CIFS VFS: Write2 ret -11, wrote 0

There was a miscomapre, and this time we have a tcpdump output.
Unfortunately, lot of packets were dropped, but nonetheless
will look at the output once I am able to access the file.

b165s1:/tcp # tcpdump -s 256 -i eth4 -w DataMisCompare-trial01-part1
tcpdump: listening on eth4, link-type EN10MB (Ethernet), capture size 256 bytes
5431443 packets captured
20861671 packets received by filter
9998785 packets dropped by kernel

b165s1:/usr/src/linux # grep "00000000 " /var/log/messages | more
b165s1:/usr/src/linux # grep "11111111 " /var/log/messages | more
b165s1:/usr/src/linux # grep "6db66db6 " /var/log/messages | more
Jul  1 23:59:50 b165s1 kernel:  6db66db6 6db66db6 m ¶ m ¶ m ¶ m ¶
Jul  1 23:59:50 b165s1 kernel:  6db66db6 6db66db6 m ¶ m ¶ m ¶ m ¶
Jul  1 23:59:50 b165s1 kernel:  6db66db6 6db66db6 m ¶ m ¶ m ¶ m ¶
Jul  1 23:59:50 b165s1 kernel:  6db66db6 6db66db6 m ¶ m ¶ m ¶ m ¶
Jul  1 23:59:50 b165s1 kernel:  6db66db6 6db66db6 m ¶ m ¶ m ¶ m ¶

The cifs printfs do not show any pages written with 0s.
Comment 2 Tomas Smetana 2008-11-06 09:40:57 EST
Created attachment 322730 [details]
Same as the previous one, but changed 3 seconds from 7 seconds in two places in connect.c in ipv6_connect
Comment 3 Jeff Layton 2008-11-06 10:02:53 EST
So what's the reason for the divergence from upstream here? Why should we be using a 7s instead of a 3s one?
Comment 4 Jeff Layton 2008-11-06 10:06:04 EST
Looks like upstream does this? (ugh, hate that double-negative)

        (*csocket)->sk->sk_rcvtimeo = 7 * HZ;
        if (!noblocksnd)
                (*csocket)->sk->sk_sndtimeo = 3 * HZ;
Comment 5 Jeff Layton 2008-11-06 10:14:30 EST
Created attachment 322739 [details]
backport of all 3 patches

The timeouts don't look right, in ipv6_connect on that last patch. It looks like we're not actually setting sndtimeo in the normal case (noblocksnd unset).

I think this one looks like it does the right thing...
Comment 6 Tomas Smetana 2008-11-06 10:19:38 EST
Changing flags to "blocker?" since this bug may lead to data corruption.
Comment 7 Jeff Layton 2008-11-06 13:46:35 EST
Created attachment 322772 [details]
backport of all 3 patches -- fix missing sk_rcvtimeo set too

My last patch removed the set of sk_rcvtimeo in ipv6_connect. This one should be right
Comment 9 Jeff Layton 2008-11-08 11:17:09 EST
Email from Shirish about a reproducer for this. I haven't attempted his method yet, but we may be able to speed this along by combining the test below with a netem qdisc filter to simulate heavy packet loss:

----------------[snip]--------------

Anyone who needs to recreate this without using IBM test tools can do so 
with
the following setup:

Two systems (one client, one server)
On the server create 6 or more shares (the more threads the quicker the 
repro)
From the client mount each of the shares using CIFS

Create the following script named "copy" and run it against each mount 
point on
the client:

     #!/bin/bash
     dd if=$1 of=$2/$1 bs=1k 2>/dev/null
     while `diff $1 $2/$1 >dev/null`
     do
     dd if=$1 of=$2/$1 bs=1k 2>/dev/null
     done

I used a file named source.txt that was about 1.5M (any file will do) with
mounts named sharen - shares - so in six terminals:

     ./copy source.txt /mnt/sharen 
     ./copy source.txt /mnt/shareo
     etc. etc. for all six mounts

start a "tail -f /var/log/messages" in another windows and after a couple 
of
hours you should see:

"CIFS VFS: No writeable handles for inode"


This will be accompanied by file truncation on one of the sessions, even
though if you remove the >/dev/null on the dd inside the while loop you
will note that dd thinks it successfully wrote the data to the drive.

We are still working on a script to more closely mirror the IBM test tool
which results in a block of 0's in the middle of the file.  However, this
script causes unreported data loss, so it should be enough to show that
there is an issue.
Comment 10 Jeff Layton 2008-11-10 08:46:38 EST
Testing the above reproducer now. I've also added a netem qdisc that simulates 10% packet loss. I'm hoping that will help spur the failure earlier. So far it seems to be running w/o any problems and I haven't seen the message pop in the ring buffer.
Comment 11 Jeff Layton 2008-11-10 10:11:35 EST
Still running and no errors yet.

As a side note that cifs over ipv6 isn't implemented in RHEL5, so we don't need to obsess over getting that codepath right. It doesn't hurt to do so, however.
Comment 12 Jeff Layton 2008-11-11 09:56:35 EST
The reproducer finally hit an error around 7pm last night. I'm rerunning it now with the patch. If it hasn't failed by tomorrow morning, I'll declare victory and post the patch for review tomorrow.
Comment 13 Jeff Layton 2008-11-12 07:36:22 EST
The reproducer ran overnight so it looks like it generally fixes the reproducer. That said, it looks like there is a problem with this patch. If we don't send all of the frame in smb_send2, then we're marking the socket CifsNeedReconnect, but what about smb_send?

Granted, that's generally for sending smaller frames and the problem is less likely to occur there (and probably won't result in data corruption), but it seems like we should be doing the same thing there.
Comment 14 Jeff Layton 2008-11-12 08:36:37 EST
Ugh, fixing *that* is non-trivial. The function interfaces here are a mess...

smb_send doesn't take a TCP_Server_Info arg, it takes a socket. So we don't have an easy way to mark the socket as needing to be reconnected. We could change it to take a server arg, but then we'll have to fix the ipv4_connect code to do something different.

To fix this the right way will mean a number of cascading changes. We may just have to take the patch as-is for now, and try to fix it the right way later. I'll wait until Shirish chimes in before I post it though...
Comment 15 Shirish S. Pargaonkar 2008-11-12 09:52:05 EST
(In reply to comment #14)

I think we should take these patches as it is.  They fix the data integrity
problem and in the testing done so far, they have not introduced any
newer bugs/problems.

There are definitely places and ways these patches could and should be
improved but for now, I think we should  incorporate the patches as they are.
Comment 16 Jeff Layton 2008-11-12 10:02:38 EST
Agreed, but mainly because we're running out of time for 5.3. Hopefully you can get these problems fixed upstream so that we can pull those fixes in for 5.4.
Comment 19 Don Zickus 2008-11-18 16:21:15 EST
in kernel-2.6.18-124.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5
Comment 21 IBM Bug Proxy 2008-12-16 18:25:52 EST
This is failing QA in IBM's tests of snapshot 4 & 5.
Comment 22 Ric Wheeler 2008-12-16 18:30:50 EST
Can we get any details on the IBM testing and the observed failures?
Comment 23 IBM Bug Proxy 2008-12-16 18:33:54 EST
From the LTC bugzilla:

Comment  #220 From Shirish S. Pargaonkar  2008-12-16 14:16:56 EST  [reply] -------

Ran Consoleapp test on RHEL 5.3 snapshot 5 and the DI error
occurred within few hours. 

Linux dende-04 2.6.18-125.el5PAE #1 SMP Mon Dec 1 17:56:48 EST 2008 i686 i686
i386 GNU/Linux

Will need to test on RHEL 5.3 snapshot 6.
Comment 24 Jeff Layton 2008-12-17 06:47:38 EST
Created attachment 327232 [details]
patch -- prevent cifs_writepages from skipping pages

Shirish emailed me late last night and said that the problem he was seeing was because RHEL5.3 was missing this patch.

Working on getting confirmation from him now. I never saw a problem with the provided reproducer without this patch, so there may be something particular about the testcase that IBM is using that helps trigger the problem.
Comment 29 Don Zickus 2008-12-17 16:12:17 EST
in kernel-2.6.18-128.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5
Comment 30 Jeff Layton 2008-12-18 07:52:49 EST
The reproducer I have survived overnight. That's a good sign, but I'll note that it did that without the latest patch as well. We'll have to wait for results from IBM's proprietary reproducer for final results.
Comment 31 Jeff Layton 2008-12-18 10:16:30 EST
Shirish's reproducer also survived last night on his modified RHEL5 kernel. That kernel had the patch that we applied to -128.el5. He's now testing the actual -128.el5. We should have final results tomorrow, but I expect that it will also survive. The CIFS code he tested last night should be virtually identical to the code in -128.el5.
Comment 32 Jeff Layton 2008-12-19 06:16:37 EST
Email from Shirish this morning:

-------------------[snip]-------------------

The test is passing (as expected) on the .128 kernel.  Thanks for all
the help to get all the fixes in.
It would have been unfortunate if we had missed Shaggy's patch.

-------------------[snip]-------------------

Looks like it's good.
Comment 37 errata-xmlrpc 2009-01-20 14:38:30 EST
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2009-0225.html

Note You need to log in before you can comment on or make changes to this bug.