Bug 503288 - with atl1e driver: Corrupted MAC on input
with atl1e driver: Corrupted MAC on input
Status: CLOSED WONTFIX
Product: Fedora
Classification: Fedora
Component: kernel (Show other bugs)
11
x86_64 Linux
low Severity medium
: ---
: ---
Assigned To: Kernel Maintainer List
Fedora Extras Quality Assurance
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2009-05-30 16:17 EDT by Gene Czarcinski
Modified: 2010-06-28 08:52 EDT (History)
7 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2010-06-28 08:46:08 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
lspci -v output showing both new and old NICs (9.42 KB, text/plain)
2009-05-30 16:18 EDT, Gene Czarcinski
no flags Details
difference between known good file and atl1e-received file (eeePC 1000HE) (2.38 KB, text/plain)
2009-06-09 18:31 EDT, Bill McGonigle
no flags Details
difference between known-good file and atl1e-transferred file (P5Q Pro) (5.21 KB, text/plain)
2009-06-09 18:33 EDT, Bill McGonigle
no flags Details


External Trackers
Tracker ID Priority Status Summary Last Updated
Linux Kernel 12282 None None None Never
Linux Kernel 13404 None None None Never

  None (edit)
Description Gene Czarcinski 2009-05-30 16:17:41 EDT
There appears to be a serious problem with the "atl1e" driver supporting the
Attansic Technology Corp. Atheros AR8121/AR8113/AR8114 PCI-E Ethernet
Controller

For me, the problem occurred when I was copying hundreds of gigabytes of ISO
image files from one system to another using ssh's "scp" command/program.

At random points during copying I would get the error "Corrupted MAC on input"
which then terminated the scp command.  This "test" was run multiple (about 6)
times and each time it failed at some (random) point.

The software: Fedora 11 preview with "latest" updates and the
2.6.29.4-167.fc11.x86_64 kernel.

The hardware: ASUS M4A78 PRO motherboard, AMD Phenom II 940 processor (3 GHz,
four CPUs), 8 GB system memory.  The Atheros Ethernet Controller integrated on
the mobo. (I will be attaching the output of "lspci".

Why do I believe it is the driver --

1.  I installed Fedora 10 running the 2.6.27.24-170.2.68.fc10.x86_64 kernel.  I
again ran a half dozen tests with NO failures.

2.  I installed Fedora 11 preview with updates on another system (4400 dual
processor) with a Realtek Semiconductor Co., Ltd. RTL-8169 Gigabit Ethernet
(rev 10) NIC. I then ran 4 tests copys with NO failures.

3.  Finally, I installed a new PCI Express NIC on the Phenom system --  D-Link
System Inc DGE-560T PCI Express Gigabit Ethernet Adapter (rev 13).  I then ran
8 copy tests with NO failures.

Conclusion: major problem in the atl1e driver

Although I did not test this and thus have no proof, I suspect copying large
amounts of data with something like ftp to this system via atl1e would also
result in corrupted data but the only way to detect it would be by checksumming
the files.

This has been reported to the kernel bug tracker as:
http://bugzilla.kernel.org/show_bug.cgi?id=13404
Comment 1 Gene Czarcinski 2009-05-30 16:18:46 EDT
Created attachment 345983 [details]
lspci -v output showing both new and old NICs
Comment 2 Bill McGonigle 2009-06-01 22:27:18 EDT
I'm seeing this as well on an ASUS P5Q Pro mobo (same controller), even with very small tranfers (rsync over ssh just gets going...).  I saw it earlier on an Asus eeePC 1000HE, with the same NIC, but after much more data transfer.

Both on F11 RC2, all updates.

Gene, have you tired disabling features with ethtool yet?
Comment 3 Bill McGonigle 2009-06-01 22:59:46 EDT
I saw this:

  https://bugs.launchpad.net/ubuntu/+source/linux/+bug/60764

so I tried turning off all the offload the driver allows, but no improvement.

Plain 'tx off' and 'rx off' are apparently not supported by this driver.
Comment 4 Bill McGonigle 2009-06-02 01:15:24 EDT
Gene and I apparently have the same box of parts.  I put in a PCI netgear-branded:

04:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL-8169 Gigabit Ethernet (rev 10)

and the situation is much improved in terms of frequency, but still gets corrupted packets.  I know this adapter works fine on Fedora 10.

using vbindiff on a file transferred unreliably with netcat (on purpose), I see a very consistent pattern:

087D 81A0: FC 77 D0 2E A8 1F 3C 63  F8 5E 4F 5D 50 AB 26 00  .w....<c .^O]P.&.
087D 81B0: 69 B5 C6 E4 8F 52 45 83  7C DE C7 32 67 56 E1 1A  i....RE. |..2gV..
087D 81C0: 25 81 97 D4 60 33 24 2B  E9 CF EB 4D 91 77 09 59  %...`3$+ ...M.w.Y

087D 81A0: FC 77 D0 2E A8 1F 3C 63  F8 5E 4F 5D 50 AB 26 00  .w....<c .^O]P.&.
087D 81B0: 0D 8F 46 D8 01 1C 67 AF  20 8F 60 14 3D 0D 96 86  ..F...g.  .`.=...
087D 81C0: 25 81 97 D4 60 33 24 2B  E9 CF EB 4D 93 77 09 59  %...`3$+ ...M.w.Y

that is, the pattern of bad bytes, 1's here, is always the same:

087D 81A0: 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00
087D 81B0: 11 11 11 11 11 11 11 11  11 11 11 11 11 11 11 11
087D 81C0: 00 00 00 00 00 00 00 00  00 00 00 00 11 00 00 00

this repeats throughout the file, for as long as I was willing to keep hitting enter.

These are the offsets of the first lines of data in error:

012F 74B0
020D 3E30
0369 9BB0
038F 3E30
03AF 7C30
044D 6CB0
0479 1030
0487 6DB0

The spacing between them seems pretty random.

In this case the sending machine was with the netgear NIC, on 64-bit kernel, and the receiver is on 32-bit rawhide, using a:

02:00.0 Ethernet controller: Marvell Technology Group Ltd. 88E8053 PCI-E Gigabit Ethernet Controller (rev 22)

I haven't seen issues between the machine with the Marvell controller and anything else (of course, that's not the only difference among the various machines).  The eeePC is also on 32-bit and still has occasional issues.
Comment 5 Gene Czarcinski 2009-06-02 11:47:51 EDT
I have another system with a AMD 4400+ dual processor on an ABIT motherboard.  The onboard NIC failed (but the rest of the mobo worked) so installed a Netgear NIC (same as yours):
05:0a.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL-8169 Gigabit Ethernet (rev 10)

For testing, I installed Fedora 11 preview along with available updates.  I then ran my scp test (either times) ... I saw no errors.

I have seen no errors with the D-Link NIC on the Phenom system since installing it.
Comment 6 Gregory P. Smith 2009-06-07 15:22:53 EDT
fyi - I have experienced received data corruption problems using the r8169 driver on ubuntu 9.04's 2.6.28 based kernel.

  https://bugs.launchpad.net/ubuntu/+source/linux/+bug/384584
Comment 7 Bill McGonigle 2009-06-08 19:03:59 EDT
Please ignore my comment #4.  This appears to be the result of a bad flash drive used to sneakernet the file.  The netgear card appears to be fine at this point.  I'll do some more work on the atl1e data, which does appear to still be corrupting.
Comment 8 Bill McGonigle 2009-06-09 01:33:53 EDT
As well as regularly providing misleading information on this bug, I seem to be chasing this deeper down a rabbit hole of unlikely problems.

Turns out the flash drive isn't bad, but writes to the flash drive are unreliable on the same machine as is having the atl1e corruption.  SATA hard disk access seems to be perfect, but SATA optical drive access appears flaky.

I could either be having multiple independent problems causing data corruption, or perhaps there could be a common cause (I don't really know the architecture of the machine) being on an integrated chipset.  Gene, would you have a few minutes to test out USB write operations to compare notes?

Running this on my machine (the USB drive was on sdb):

dd if=/dev/urandom of=/dev/sdb bs=8M ; dd if=/dev/sdb of=random-kingston-1.dd bs=8M; dd if=random-kingston-1.dd of=/dev/sdb bs=8M; dd if=/dev/sdb of=random-kingston-2.dd bs=8M; sha256sum random-kingston-* > random-kingstons.sha256; cat random-kingstons.sha256

I get consistently identical checksums on my old laptop (macbook pro) running F11 i686 and consistently different results on my ASUS P5Q Pro-based desktop running F11 x86_64.  I was working with both 4GB and 8GB flash drives, one from Kingston and one from PNY, and they both exhibit the same problem on this machine.  Repeatedly only reading the same data from the flash drive appears to be consistent.  I see the same thing whether I'm writing to the raw device or to a file on a vfat filesystem on the device.  It's during this operation that the pattern I mentioned in comment #4 is manifested.

Booting off an f10 i686 LiveCD I'm not seeing this.  I am seeing it on the installed f11 x86_64 and an f11 i686 LiveCD, same as the network problem.  I've got the eeePC working on the same test, but it seems to be running at USB 1 speeds on the 8GB drive, so it'll be done sometime tomorrow.
Comment 9 Bug Zapper 2009-06-09 12:51:53 EDT
This bug appears to have been reported against 'rawhide' during the Fedora 11 development cycle.
Changing version to '11'.

More information and reason for this action is here:
http://fedoraproject.org/wiki/BugZappers/HouseKeeping
Comment 10 Bill McGonigle 2009-06-09 18:29:51 EDT
The USB problem is not seen on the eeePC, but the network corruption is.  When sending files with netcat, corruption is seen when the atl1e is receiving the bulk of the data, but not (in my limited testing) when it's sending the bulk of the data.  The corruption seems to be sneaking past any TCP checksumming.

Attachments with cmp's of a 3GB random file on two hardware platforms to follow.
Comment 11 Bill McGonigle 2009-06-09 18:31:30 EDT
Created attachment 347106 [details]
difference between known good file and atl1e-received file (eeePC 1000HE)
Comment 12 Bill McGonigle 2009-06-09 18:33:36 EDT
Created attachment 347107 [details]
difference between known-good file and atl1e-transferred file (P5Q Pro)
Comment 13 Ville Törhönen 2009-07-01 13:06:44 EDT
Created attachment 22168


I am experiencing this same problem on my hardware:

ASUS P5QL Pro 
Intel Core 2 Duo E6600
4GB RAM
Fedora 11, running 2.6.29.5-191.fc11.x86_64

I've managed to reproduce this error by logging to the computer by SSH, and
by transferring files to the computer by SCP.

This is definitely a problem with the atl1e driver.
Comment 14 Bill McGonigle 2009-08-17 20:48:07 EDT
The P5Q problems seem to have been related to the BIOS undervolting the memory.  According to the ASUS form manually setting the voltage higher than spec is required for the board to actually provide the required voltage.
Comment 15 Bug Zapper 2010-04-27 10:35:47 EDT
This message is a reminder that Fedora 11 is nearing its end of life.
Approximately 30 (thirty) days from now Fedora will stop maintaining
and issuing updates for Fedora 11.  It is Fedora's policy to close all
bug reports from releases that are no longer maintained.  At that time
this bug will be closed as WONTFIX if it remains open with a Fedora 
'version' of '11'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version prior to Fedora 11's end of life.

Bug Reporter: Thank you for reporting this issue and we are sorry that 
we may not be able to fix it before Fedora 11 is end of life.  If you 
would still like to see this bug fixed and are able to reproduce it 
against a later version of Fedora please change the 'version' of this 
bug to the applicable version.  If you are unable to change the version, 
please add a comment here and someone will do it for you.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events.  Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

The process we are following is described here: 
http://fedoraproject.org/wiki/BugZappers/HouseKeeping
Comment 16 Bill McGonigle 2010-04-27 12:12:02 EDT
This still exists on f12, at least on an eeePC 1000HE.  The Ubuntu guys have a similar open bug, no fix but to turn off offload on machines that support it (many don't).
Comment 17 Bug Zapper 2010-06-28 08:46:08 EDT
Fedora 11 changed to end-of-life (EOL) status on 2010-06-25. Fedora 11 is 
no longer maintained, which means that it will not receive any further 
security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of 
Fedora please feel free to reopen this bug against that version.

Thank you for reporting this bug and we are sorry it could not be fixed.
Comment 18 Bill McGonigle 2010-06-28 08:52:09 EDT
(In reply to comment #16)
> This still exists on f12, at least on an eeePC 1000HE.  The Ubuntu guys have a
> similar open bug, no fix but to turn off offload on machines that support it
> (many don't).    

reporter or maintainer: please bump version.

Note You need to log in before you can comment on or make changes to this bug.