Bug 504811 - e1000 silently corrupting data
e1000 silently corrupting data
Status: CLOSED WONTFIX
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel (Show other bugs)
5.3
All Linux
medium Severity medium
: rc
: ---
Assigned To: Andy Gospodarek
Red Hat Kernel QE team
:
Depends On:
Blocks: 533192 5.5_Known-Issues
  Show dependency treegraph
 
Reported: 2009-06-09 11:16 EDT by Issue Tracker
Modified: 2014-06-29 19:01 EDT (History)
13 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Data corruption on NFS filesystems might be encountered on network adapters without support for error-correcting code (ECC) memory that also have TCP segmentation offloading (TSO) enabled in the driver. Note: data that might be corrupted by the sender still passes the checksum performed by the IP stack of the receiving machine A possible work around to this issue is to disable TSO on network adapters that do not support ECC memory.
Story Points: ---
Clone Of:
Environment:
Last Closed: 2010-02-08 14:24:35 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Issue Tracker 2009-06-09 11:16:51 EDT
Escalated to Bugzilla from IssueTracker
Comment 1 Issue Tracker 2009-06-09 11:16:53 EDT
Event posted on 08-22-2006 03:27pm EDT by woodard

Date: Tue, 22 Aug 2006 12:23:49 -0700
From: Ben Woodard <woodard@redhat.com>
User-Agent: Thunderbird 1.5.0.5 (X11/20060719)
MIME-Version: 1.0
To:  alan@redhat.com,  tech-list@redhat.com
Subject: Re: network file system corruption with TSO on.
References: <44E6351E.1070501@redhat.com> <20060818222637.GA3060@devserv.devel.redhat.com> <44EA1C95.9040207@redhat.com>
In-Reply-To: <44EA1C95.9040207@redhat.com>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit

Ben Woodard wrote:
> Alan Cox wrote:
>> On Fri, Aug 18, 2006 at 02:46:06PM -0700, Ben Woodard wrote:
>>> I don't think I'd feel comfortable running NFS over an interface that
>>> had TSO. When a packet is checksumed in memory ECC still protects the
>>> data integrity. However, when the data goes to the card, nothing
>>> protects it between the time it is sent to the card and when the
>>> checksum is computed. If an error is introduced there, then it won't
>>> be detected. Most of our corruption was read corruption but some of
>>> it was
>>
>> Jonathan Stone and Craig Partridge actually tried to do some analysis on
>> this, and the paper "When the CRC and TCP checksum disagree" should be
>> compulsory reading for checksum offload weenies and anyone working with
>> very large data sets over IP.
>>
>> Lustre should maybe be doing its own checksumming given the volumes in
>> question and keeping it on disk.
>>
>> Also the TCP checksum itself isn't brilliant - its a fairly simple 16bit
>> checksum algorithm with vulnerabilities to things like reordering and
>> some kind of repeat. You rely on it each router hop so this is a growing
>> concern as data rates and volumes continue to rise.

Given these concerns and the fact that we already have one proven
example where TSO ends up corrupting data when there is a hardware
problem, is it really wise to have TSO turned on by DEFAULT.

The problem is we have ECC and EDAC for main memory and so if the data
is corrupted there, we should at least know about it. The PCI bus also
has some sort of ECC or parity and supposedly is able to inform the
kernel when data is corrupted. (Is anything actually watching for these
messages though?) However, we have seen that once the data leaves the
PCI bus but before the card computes the TCP checksum on the ethernet's
buffers, we can have data corruption. This seems like a big freaking
issue. Are the financial houses OK with the fact that their ethernet
cards can SILENTLY corrupt data? Are they aware that they MUST have a
higher level of data integrity checking on their TCP connections when
using cards such as the e1000 which use TSO by default? Do we publish
this recommendation anywhere? Is leaving TSO on by default the right
thing to do for an enterprise distribution?

-ben


>>
>> See also
>>
>> http://www.storagetek.com/hughes/AAL5_CRC.pdf#search=%22%22Performance%20of%20checksums%20and%20CRCs%20over%20real%20data.%22%22
>>
>>
>>
> Thanks for the references to the articles. Currently, I do not think
> that Lustre does do any checksumming. I think that right now we are
> reconsidering our vulnerabilities and we'll come up with a plan of
> action soon.
>
> One thing that really has been driven home time and time again working
> here at the lab is as you increase the scale and the speed improbable
> events really do happen fairly frequently. This is kind of why I have no
> problem believing in evolution and laugh at how niave most creationist's
> arguments are. ;-)
>
> -ben
>


-----

Date: Mon, 21 Aug 2006 12:36:21 -0700
From: Ben Woodard <bwoodard@llnl.gov>
User-Agent: Thunderbird 1.5.0.5 (X11/20060719)
MIME-Version: 1.0
To:  behlendorf1@llnl.gov
CC: Doug East <dre@llnl.gov>, Ryan Braby <braby1@llnl.gov>,
 Chris Dunlap <dunlap6@llnl.gov>,
 Andrew Uselton <uselton2@llnl.gov>,
 Richard Hedges <richard-hedges@llnl.gov>,
 Terry Heidelberg <th@llnl.gov>, Keith Fitzgerald <fitzgerald2@llnl.gov>,
 MarcStearman <stearman2@llnl.gov>,
 Mark Grondona <mgrondona@llnl.gov>, Michael Miller <mke@llnl.gov>,
 Kim Cupps <kimcupps@llnl.gov>
Subject: Re: Corruption on /p/lscratcha
References: <200608172045.28330.behlendorf1@llnl.gov> <200608181407.10832.behlendorf1@llnl.gov> <p06230903c10c124e6ce4@[134.9.93.143]>
In-Reply-To: <p06230903c10c124e6ce4@[134.9.93.143]>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit

Brian,

I've been trying to wrap my head around the full implications of our
discoveries and quite frankly it scares me. I did a bit of digging and
evidently relying on TCP checksums for packets while fast, is not
reliable. I think that we uncovered a pretty significant problem when we
discovered that the e1000 doesn't have ECC bits in its buffers. Thus
once the data leaves the PCI bus there is no parity at all and it can
become corrupted. In the normal case where TSO is off, the TCP checksum
is computed before the data crosses into this unprotected region.
However, with TSO on then there is nothing ensuring that errors don't
creep in from the time it is sent down the PCI bus until the time the
checksum is computed by the NIC.

I was talking to Alan Cox and he pointed out that TCP's checksum isn't
the most effective function. It is is a kind of hash function and
theoretically when fed random data there are no hot spots of hash
collisions. However, in the real world and especially with highly
repetitive data sets such as we have here at LLNL in these
checkpoint/restart files, TCP's checksum function breaks down pretty
badly. There are huge numbers of hot spots. If we really need solid data
integrity then we are probably going to have to add some additional
safeguards such as having Lustre do some application level checksums. He
suggested that we read the following papers:
http://portal.acm.org/citation.cfm?doid=347059.347561#search=%22When%20the%20CRC%20and%20TCP%20checksum%20disagree%22
http://www.storagetek.com/hughes/AAL5_CRC.pdf#search=%22%22Performance%20of%20checksums%20and%20CRCs%20over%20real%20data.%22%22http://www.storagetek.com/hughes/AAL5_CRC.pdf#search=%22%22Performance%20of%20checksums%20and%20CRCs%20over%20real%20data.%22%22


Doug East wrote:
> Thanks Brian and all who worked this problem.
>
> Doug
>
>
> At 2:07 PM -0700 8/18/06, Brian Behlendorf wrote:
>>   We believe this issue has been resolved after further investigation
>> today. The problem was determined to be a misbehaving e1000 network
>> adaptor.  The
>> adaptor was corrupting data in its TX buffer before generating the TCP
>> checksum for the datagram.  This caused a valid checksum to be
>> created for
>> corrupted data.  The offending GigE adaptor has been replaced and
>> thus far we
>> haven't seen any new corruption.
>>
>>   To verify things are now good I put together a test script which is
>> recursively copying files to /p/lscratcha and verifying the data is
>> correct. It has run for roughly 45 minutes and moved 6000 files
>> without encountering
>> any corruption.
>>
>>   In the clear light of day my initial suspicions implicating the TCP
>> stack or
>> network driver turned out to be wrong.  I was misinterpreting the
>> tcpdump
>> data due to the TSO support being enabled.  Mark Grondona cleared
>> this up for
>> me, thanks!  A notice to the users will be sent out explaining the
>> issue has
>> been resolved.
>>
>> --
>> Thanks,
>> Brian
>>
>>
>>>    This evening Chris Dunlap and I dug in to the unexpected
>>> corruption issue
>>>  observed on /p/lscratcha and made some progress.  We confirmed that
>>> the DDN
>>>  disk is not the root cause.  This is being caused by either the TCP
>>> stack
>>>  in linux or the e1000 network driver.
>>>
>>>    With Chris's help we dug through several tcpdump logs using
>>> ethereal and
>>>  observed that for many datagrams the TCP checksum was simply wrong
>>> even
>>>  before it left the source.  Perhaps more disturbingly this is a
>>> case TCP
>>>  should be able to handle without corruption.  The packet with a bad
>>>  checksum should simply be dropped and not ACKed but that is not
>>> what is
>>>  happening.  At the moment the thinking is we have at least two bugs
>>> in the
>>>  networking code causing this.
>>>
>>>    More analysis will obviously have to be done tomorrow to entirely
>>>  understand the bug and come up with a workaround/fix.  One potential
>>>  (untested) fix will be to revert the latest chaos kernel which is
>>> running
>>>  on bigs to an older version.  Looking at the kernel change log
>>> reveals one
>>>  small change in the e1000 driver which may be related.
>>>
>>>    Another interesting tidbit concerning this problem is it is not
>>> confined
>>>  to the lustre networks.  We easily observed the same issue on the bigs
>>>  management network while debugging.  Thus while we noticed this
>>> first via
>>>  lustre the problem isn't confined to lustre.
>



This event sent from IssueTracker by kbaxley  [LLNL (HPC)]
 issue 100410
Comment 2 Issue Tracker 2009-06-09 11:16:56 EDT
Event posted on 08-22-2006 03:30pm EDT by woodard

Those links are:
http://portal.acm.org/citation.cfm?doid=347059.347561#search=%22When%20the%20CRC%20and%20TCP%20checksum%20disagree%22
http://www.storagetek.com/hughes/AAL5_CRC.pdf#search=%22%22Performance%20of%20checksums%20and%20CRCs%20over%20real%20data.%22%22http://www.storagetek.com/hughes/AAL5_CRC.pdf#search=%22%22Performance%20of%20checksums%20and%20CRCs%20over%20real%20data.%22%22

Also Mark figured out that TSO has been enabled by default on the e1000
since at least 2003.



woodard assigned to issue for LLNL (HPC).
Status set to: Waiting on Tech

This event sent from IssueTracker by kbaxley  [LLNL (HPC)]
 issue 100410
Comment 3 Issue Tracker 2009-06-09 11:16:59 EDT
Event posted on 08-22-2006 05:12pm EDT by woodard

Brian tells me that according to the E1000 data sheets for the card we had
problems there is no mention of ECC in its buffers. However, for the 10G
enet they do explicitly state that they have internal ECC. Sounds like
someone has learned their lesson.

In the code in the driver which selects if TSO should be used is as
follows:
#ifdef NETIF_F_TSO
        if ((adapter->hw.mac_type >= e1000_82544) &&
           (adapter->hw.mac_type != e1000_82547))
                netdev->features |= NETIF_F_TSO;

#ifdef NETIF_F_TSO_IPV6
        if (adapter->hw.mac_type > e1000_82547_rev_2)
                netdev->features |= NETIF_F_TSO_IPV6;
#endif
#endif

This sort of implies that devices < e1000_82544 didn't support TSO and
that there is something wrong with the TSO in e1000_82547. What is wrong
with 82547? (rhetorical question) At what point do they set the bar and
decide that TSO is not safe to use on that particular adapter for fear of
it corrupting data. It would seem that any card that doesn't have ECC for
its buffers might qualify in the context of an enterprise distribution.

woodard assigned to issue for LLNL (HPC).
Internal Status set to 'Waiting on Support'

This event sent from IssueTracker by kbaxley  [LLNL (HPC)]
 issue 100410
Comment 4 Issue Tracker 2009-06-09 11:17:01 EDT
Event posted on 08-22-2006 07:07pm EDT by woodard

From: Brian Behlendorf <behlendorf1@llnl.gov>
Reply-To: behlendorf1@llnl.gov
Organization: LLNL
To: Ben Woodard <bwoodard@llnl.gov>
Subject: TSO in the e1000 driver
Date: Tue, 22 Aug 2006 15:54:19 -0700
User-Agent: KMail/1.7.1
Cc: Ryan Braby <braby1@llnl.gov>, Keith Fitzgerald <kfitz@llnl.gov>,
        Mark Grondona <mgrondona@llnl.gov>, Kim Cupps
<kimcupps@llnl.gov>,
        Jim Garlick <garlick1@llnl.gov>, shereda@llnl.gov,
        Bryan Lawver <lawver1@llnl.gov>, Terry Heidelberg <th@llnl.gov>
References: <200608172045.28330.behlendorf1@llnl.gov>
<p06230903c10c124e6ce4@[134.9.93.143]> <44EA0B35.80905@llnl.gov>
In-Reply-To: <44EA0B35.80905@llnl.gov>
MIME-Version: 1.0
Content-Type: text/plain;
  charset="iso-8859-1"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
Message-Id: <200608221554.19922.behlendorf1@llnl.gov>


Ben,

  I think you laid out a nice synopsis of this potentially serious but
likely
very very rare problem.  Let me clarify a few points as to why I think we
probably don't want to rush to disable TSO on all our systems as a fix
for
this issue.

  First off, TSO support is not off by default in the e1000 driver.  It
is
_on_ by default for newer models of the e1000 adaptor including ours.  It
has
been enabled in our chaos kernel by default back to at least the
2.6.9-1chaos
kernel and likely earlier than that.  The tg3 driver on the other hand
has
TSO support disabled by default.

  Secondly, I do think disabling TSO probably would have prevented this
corruption but that has not been conclusively tested.  Bryan Lawver
wasn't
able to successfully verify this since installing the card in a test node
resulted in a hung test node.  I also think disabling TSO support will
have
additional undesirable consequences which may not have been mentioned.  I
fear disabling TSO support will put us in a situation where we will not
detect similar adaptor problems in the future.  If things work as I'd
expected we would only see an abnormally high nigh number of TCP
retransmissions on the node with the bad adaptor and potentially reduced
performance.  Neither of these things are currently monitored closely
enough
today to detect the issue.

  Thirdly, disabling TSO support will impact lustre performance.  We know
at a
minimum it will increase the CPU usage on all TCP attached nodes and force
us
to disable zero-copy support which depends on TSO.  The exact extent of
the
performance impact we have not measures, but I suspect it will not be
trivial.

  The right fix for all of these issues I think will be Lustre level
end-to-end checksumming which is currently implemented in a dormant form
in
the current code.  It will need some work to become production ready but
it
is the best solution to address the concerns I mentioned above.

Thanks,
Brian


> Brian,
>
> I've been trying to wrap my head around the full implications of our
> discoveries and quite frankly it scares me. I did a bit of digging and
> evidently relying on TCP checksums for packets while fast, is not
> reliable. I think that we uncovered a pretty significant problem when
we
> discovered that the e1000 doesn't have ECC bits in its buffers. Thus
> once the data leaves the PCI bus there is no parity at all and it can
> become corrupted. In the normal case where TSO is off, the TCP checksum
> is computed before the data crosses into this unprotected region.
> However, with TSO on then there is nothing ensuring that errors don't
> creep in from the time it is sent down the PCI bus until the time the
> checksum is computed by the NIC.
>
> I was talking to Alan Cox and he pointed out that TCP's checksum
isn't
> the most effective function. It is is a kind of hash function and
> theoretically when fed random data there are no hot spots of hash
> collisions. However, in the real world and especially with highly
> repetitive data sets such as we have here at LLNL in these
> checkpoint/restart files, TCP's checksum function breaks down pretty
> badly. There are huge numbers of hot spots. If we really need solid
data
> integrity then we are probably going to have to add some additional
> safeguards such as having Lustre do some application level checksums.
He
> suggested that we read the following papers:
>
http://portal.acm.org/citation.cfm?doid=347059.347561#search=%22When%20the%
>20CRC%20and%20TCP%20checksum%20disagree%22
>
http://www.storagetek.com/hughes/AAL5_CRC.pdf#search=%22%22Performance%20of
>%20checksums%20and%20CRCs%20over%20real%20data.%22%22http://www.storagetek.c
>om/hughes/AAL5_CRC.pdf#search=%22%22Performance%20of%20checksums%20and%20CRC
>s%20over%20real%20data.%22%22


Internal Status set to 'Waiting on Support'

This event sent from IssueTracker by kbaxley  [LLNL (HPC)]
 issue 100410
Comment 5 Issue Tracker 2009-06-09 11:17:03 EDT
Event posted on 08-22-2006 07:27pm EDT by woodard

From: Alan Cox <alan@redhat.com>
To: Ben Woodard <woodard@redhat.com>
Cc: alan@redhat.com, tech-list@redhat.com
Subject: Re: network file system corruption with TSO on.
Message-ID: <20060822230917.GA1754@devserv.devel.redhat.com>
References: <44E6351E.1070501@redhat.com>
<20060818222637.GA3060@devserv.devel.redhat.com>
<44EA1C95.9040207@redhat.com> <44EB59C5.10001@redhat.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <44EB59C5.10001@redhat.com>
User-Agent: Mutt/1.4.1i

On Tue, Aug 22, 2006 at 12:23:49PM -0700, Ben Woodard wrote:
> example where TSO ends up corrupting data when there is a hardware
> problem, is it really wise to have TSO turned on by DEFAULT.

For most applications TSO is "good enough"

> The problem is we have ECC and EDAC for main memory and so if the data
> is corrupted there, we should at least know about it. The PCI bus also
> has some sort of ECC or parity and supposedly is able to inform the
> kernel when data is corrupted. (Is anything actually watching for these
> messages though?) However, we have seen that once the data leaves the

EDAC now

> cards can SILENTLY corrupt data? Are they aware that they MUST have a
> higher level of data integrity checking on their TCP connections when
> using cards such as the e1000 which use TSO by default? Do we publish
> this recommendation anywhere? Is leaving TSO on by default the right
> thing to do for an enterprise distribution?

Good quetion. I don't know what we should be doing here for those cases.
Maybe we actually need some kind of "paranoid mode" tutorial on these
issues - I don't think most users understand them.

Alan



This event sent from IssueTracker by kbaxley  [LLNL (HPC)]
 issue 100410
Comment 6 Issue Tracker 2009-06-09 11:17:08 EDT
Event posted on 08-22-2006 08:00pm EDT by woodard

The actual card that failed is a 82546GB. I've got it here on my desk. We
are not sure that our cluster is completely homogeneous. We seem to have a
82456EB as well and the on-board cards are 82541GI/PI. Those ones should
not be using TSO anyway. I'm not sure if we even use them.


This event sent from IssueTracker by kbaxley  [LLNL (HPC)]
 issue 100410
Comment 7 Issue Tracker 2009-06-09 11:17:11 EDT
Event posted on 08-23-2006 03:42pm EDT by kbaxley

The e1000 cards that LLNL uses are:

82540EM Gigabit Ethernet Controller (rev 02)
82541GI/PI Gigabit Ethernet Controller
82546EB Gigabit Ethernet Controller (Copper) (rev 01)
82546GB Gigabit Ethernet Controller (rev 03)

The majority of them are the 82546GB, which is the one that the've had
the problems with. 



This event sent from IssueTracker by kbaxley  [LLNL (HPC)]
 issue 100410
Comment 8 Issue Tracker 2009-06-09 11:17:14 EDT
Event posted on 08-23-2006 03:47pm EDT by woodard

Date: Wed, 23 Aug 2006 10:05:53 -0700
From: Kim Cupps <kimcupps@llnl.gov>
Subject: Re: TSO in the e1000 driver
In-reply-to: <200608221554.19922.behlendorf1@llnl.gov>
To: behlendorf1@llnl.gov, Ben Woodard <bwoodard@llnl.gov>
Cc: Ryan Braby <braby1@llnl.gov>, Keith Fitzgerald <kfitz@llnl.gov>,
        Mark Grondona <mgrondona@llnl.gov>, Kim Cupps
<kimcupps@llnl.gov>,
        Jim Garlick <garlick1@llnl.gov>, shereda@llnl.gov,
        Bryan Lawver <lawver1@llnl.gov>, Terry Heidelberg <th@llnl.gov>
Message-id: <7.0.0.16.2.20060823100435.056e1898@llnl.gov>
MIME-version: 1.0
X-Mailer: QUALCOMM Windows Eudora Version 7.0.0.16
Content-type: text/plain; charset=us-ascii; format=flowed
Content-transfer-encoding: 7BIT
References: <200608172045.28330.behlendorf1@llnl.gov>
 <p06230903c10c124e6ce4@[134.9.93.143]> <44EA0B35.80905@llnl.gov>
 <200608221554.19922.behlendorf1@llnl.gov>

Brian,

I think until we get Lustre checksumming working and production ready
we ought to disable TSO. We can't say that we can afford the
possibility of corrupting data because of improved performance. I
agree that TSO can be enabled once we have the checksumming working.

Regards,
Kim



This event sent from IssueTracker by kbaxley  [LLNL (HPC)]
 issue 100410
Comment 9 Issue Tracker 2009-06-09 11:17:16 EDT
Event posted on 02-02-2009 07:56pm EST by woodard

Look it is not safe to enable TSO on that particular card. It can silently
corrupt data. We know this, we worked around the problem by disabling TSO.
We suggested that RH modify the driver to disable TSO on this particular
card unless it is explicitly turned on.

Product changed from 'Red Hat Certificate System' to 'Issue-Tracker'
Version changed from '7.1' to 'UNKNOWN'

This event sent from IssueTracker by kbaxley  [LLNL (HPC)]
 issue 100410
Comment 10 Andy Gospodarek 2009-06-09 14:13:20 EDT
Right now, I'm not sure what to make of this bug.

At first it appears that there is a request to disable TSO by default on the 82546 (and maybe all e1000) cards.

Then it seems that LLNL has determined that the best solution may be to run some checksumming on their software to prevent errors.

I'm not sure I want to disable TSO on any card that does not support ECC memory since:

1.  That's most cards.

2.  Plenty of 'regular' users who don't have any problems with TSO.

3.  It's quite easy to disable (though not as easy as it could be).

Where do you want us to go from here?
Comment 11 Ben Woodard 2009-06-09 14:37:18 EDT
If we were selling a desktop distribution, then I think that I would agree with you. However, since this is an ENTERPRISE distribution and because a VERY hard to diagnose hardware failure can lead to data corruption, I really do believe that we need to disable TSO on the cards that do not have ECC. Note that some later versions of the e1000 do. And we need to release note the reason we made the change. Then we can point users to the release note when they notice the perf drop and if they turn on TSO the problem is on their own head.

The problem is that the bytes in the TCP stream can get corrupted and there is literally no way to detect it because the checksums on the TCP packets are correct. So say you have a write to NFS and are using jumbo frames, as we were, and then in the files written by a particular machine we find that every 6200 to 12400 bytes (or something like that) one byte was wrong. What we found was one bit was permanently held low. So when the byte in the data file happened to have that one bit low there was no problem that byte so the interval of the corrupted bytes was some multiple of an integer which was related to the MTU. Do we really want to be liable for that kind of data corruption? With EXTREMELY careful study and very advanced staff we were able to detect the problem and track down its root cause. However, there is no telling how many times this kind of problem is silently happening. It is the kind of thing that unless you are extremely sophisticated you will never be able to unravel. So you can't legitimately say that no one else is having the problem. We might just be the first who can detect and unravel the root cause of the problem for you.
Comment 12 Andy Gospodarek 2009-06-16 16:09:49 EDT
I've been thinking about this and I go back and forth.  Not because this is an 'Enterprise distro' (which isn't even clearly defined), but because even desktop users would care if data was getting corrupted.  They are no more likely to enjoy corrupted data than anyone else.

When I think about this issue, questions start to come up as to what type of checksumming we will allow (I'm asking seriously here).  If we are going to limit TCP checksumming to only cards that have ECC, should we not also limit rx/tx checksumming and LRO as well?  Is the layer 3-7 portion of the data any more likely to be corrupted than the packet headers?

Imagine a bit got flipped in the destination MAC.  Sure the destination would not pick up the frame, but is the aggregate savings of not having to spend the time writing the checksum on the way out worth the retransmission time if you are using a streaming application that requires low latency?  I would expect that some of our customers might not agree.  The use case for everyone is different and why we have the option to use ethtool to disable any form of hardware checksumming and offload.

Since it's possibly to disable TSO pretty easily with ethtool one might be inclined to just close this as WONTFIX with a workaround, but I'm interested enough in the problem that I would like to at least research (starting with the e1000 and e1000e-based hardware) and see how many of those cards do not have ECC.  If we ultimately decide not to make this change we will at least have a list that can be used for a kbase/release note entry.  I'm going to add some folks from Intel to the cc-list to see if they can help us out with that list.

(Just for the record, the 82546 has a pretty nasty hardware bug and probably isn't the best choice for a card, but I understand there are probably a lot of them already installed so replacing them might not be reasonable.  I've talked with Intel about it and there is no workaround -- if you aren't hitting a lot of Tx Timeouts be happy, but know they can happen.)
Comment 13 Andy Gospodarek 2009-06-16 16:15:08 EDT
Jesse, do you mind helping us compile a list of Intel networking hardware that has ECC memory?

I know the 82571 (based on e1000e driver code) and I think the 82599 supports it as well, but I'm unsure which of the older pci/pci-x hardware has support for it and with no mention of it in the igb source, I'm not sure if all of none of it has ECC.

Thanks!
Comment 14 Jesse Brandeburg 2009-06-16 17:56:39 EDT
no ecc:
=======
all 10/100 (e100 driver)
all PCI/PCI-X gigabit (e1000)
all ixgb
82573(V/L/E)
82574
82572

ecc support:
============
e1000e:
82571
ESB2 (maybe on by default?)

igb:
82575 (default on)
82576 (default on)

ixgbe:
82598 (default on, there is errata that requires ecc off on internal descriptor completion memory, but that could not cause data corruption)
82599 (default on)

I've got internal queries out to see if any of this info needs correction, but it is what I could find on my own.
Comment 15 John Ronciak 2009-06-16 18:23:57 EDT
Yes it does but the driver had to enable it for the 82571 controller (what I think you are seeing in the code).  None of the PCI/PCI-X parts have it.  The 82598 and 82599 (10gig) part also have it.  All PCIe parts since the 82571 also have it and it is enable by default on those parts.

The igb parts have it but it's just enabled by default so the driver doesn't need to do anything about it.  So the 82575 and 82576 both have it.
Comment 16 Andy Gospodarek 2009-07-09 22:05:35 EDT
Thanks, Jesse and John for the details.  How would you feel about disabling TSO *by default* on all network cards that do not have memory that support ECC?

For those scoring at home, this would basically mean that upstream:

TSO disabled by default:
e1000
ixgb
e1000e (82572, 82573, 82574)


TSO enabled by default:
e1000e (except 82572, 82573, 82574)
igb
ixgbe

I realize that the chances of these errors are rare, but since so few people deviate from the default settings wouldn't it be wise to make sure the default setting is the most reliable?

I feel like we do this already with things like ring buffer value (we don't make it the maximum since it will consume significant memory) and MTU (we don't make it the maximum since it may prevent connectivity with systems that default to an MTU of 1500 or switches that don't support jumbo frames), and to some degree the default coalescing settings (default to a low-latency setting but move towards high-throughput if using AIM).

I would of course be willing to provide patches against upstream (though I know you are perfectly capable), as I'm not interested in this being something that I just wait on you to do.

(By the way, I'm also pursuing this with other vendors.  Intel is not the only one *lucky* enough to have this work done.)
Comment 17 Ben Woodard 2009-07-10 15:30:14 EDT
+1 on this suggestion from me.
Comment 18 John Ronciak 2009-07-11 03:43:53 EDT
Andy,

We are OK with this as long as it is done for all non-ECC memory network devices.  So we could do this for our parts but who does this for all the others?  I don't think that should be us.
Comment 19 Andy Gospodarek 2009-07-13 10:24:56 EDT
John, I don't expect you to push any fixes for any other vendor's adapters.  I'm currently talking to some of the other major vendors to get lists from them as well.

I'm willing to do the work for all of the adapters if you guys don't mind me posting the patches.  I will obviously have your team look at them and ensure that they are acceptable before posting to netdev.
Comment 20 John Ronciak 2009-07-13 13:16:51 EDT
Andy, that will be fine.  Please go ahead and make the changes to our drivers and we'll review them.  Having one person do this for all driver will make the disabling much more consistent and accurate.
Comment 21 Jeremy West 2010-02-08 14:24:35 EST
After a good deal of discussion on this bug, Red Hat is going to close this as WONTFIX.  We will not be disabling TSO by default, due to the performance problems that option would incur on certain cards.   If you have concerns, please email me directly and we can discuss this further.  

Jeremy West
NA Support Engineering Supervisor
Comment 22 Issue Tracker 2010-02-08 19:00:03 EST
Event posted on 2010-02-08 16:00 PST by woodard

At the VERY least it needs a release note so that people with the affected
cards know the jeopardy that they are in. In cases where one of these
cards is used on something like an NFS it can lead to silent data
corruption.


This event sent from IssueTracker by woodard 
 issue 100410
Comment 23 Kent Baxley 2010-02-09 15:57:10 EST
Technical note added. If any revisions are required, please edit the "Technical Notes" field
accordingly. All revisions will be proofread by the Engineering Content Services team.

New Contents:
There have been reported cases of data corruption on NFS filesystems that were traced to certain network adapters with no ECC memory support that had TSO enabled by default in the device driver.  In these cases, for example, the receiving system's IP stack assumed that the data being sent was OK because it passesd the checksum, even though the data had been corrupted by the sender.  Contact your hardware vendor to determine whether or not your network adapter supports ECC memory and, if not, consider disabling TSO support on the adapter.
Comment 26 Ryan Lerch 2010-03-22 23:32:59 EDT
Technical note updated. If any revisions are required, please edit the "Technical Notes" field
accordingly. All revisions will be proofread by the Engineering Content Services team.

Diffed Contents:
@@ -1 +1 @@
-There have been reported cases of data corruption on NFS filesystems that were traced to certain network adapters with no ECC memory support that had TSO enabled by default in the device driver.  In these cases, for example, the receiving system's IP stack assumed that the data being sent was OK because it passesd the checksum, even though the data had been corrupted by the sender.  Contact your hardware vendor to determine whether or not your network adapter supports ECC memory and, if not, consider disabling TSO support on the adapter.+Data corruption on NFS filesystems might be encountered on network adapters without support for error-correcting code (ECC) memory that also have TCP segmentation offloading (TSO) enabled in the driver. Note: data that might be corrupted by the sender still passes the checksum performed by the IP stack of the recieving machine  A possible work around to this issue is to disable TSO on network adapters that support ECC memory.
Comment 27 Andy Gospodarek 2010-03-23 09:18:13 EDT
Technical note updated. If any revisions are required, please edit the "Technical Notes" field
accordingly. All revisions will be proofread by the Engineering Content Services team.

Diffed Contents:
@@ -1 +1 @@
-Data corruption on NFS filesystems might be encountered on network adapters without support for error-correcting code (ECC) memory that also have TCP segmentation offloading (TSO) enabled in the driver. Note: data that might be corrupted by the sender still passes the checksum performed by the IP stack of the recieving machine  A possible work around to this issue is to disable TSO on network adapters that support ECC memory.+Data corruption on NFS filesystems might be encountered on network adapters without support for error-correcting code (ECC) memory that also have TCP segmentation offloading (TSO) enabled in the driver. Note: data that might be corrupted by the sender still passes the checksum performed by the IP stack of the receiving machine  A possible work around to this issue is to disable TSO on network adapters that do not support ECC memory.

Note You need to log in before you can comment on or make changes to this bug.