Bug 754589 - e1000 with default offloading on tagged vlan corrupts packets / checksum_partial proto=81
Summary: e1000 with default offloading on tagged vlan corrupts packets / checksum_part...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Fedora
Classification: Fedora
Component: kernel
Version: 16
Hardware: i686
OS: Linux
unspecified
high
Target Milestone: ---
Assignee: Kernel Maintainer List
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard: rebaseregression
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2011-11-16 22:14 UTC by Trevor Cordes
Modified: 2012-05-22 02:22 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2012-05-22 02:22:25 UTC
Type: ---
Embargoed:


Attachments (Terms of Use)
e1000_main.c.patch (967 bytes, patch)
2011-12-01 02:09 UTC, Tushar Dave
no flags Details | Diff

Description Trevor Cordes 2011-11-16 22:14:48 UTC
Description of problem:
After upgrade to Fedora 16, one of my NICs apparently stopped working.  After much investigation, I tracked it down to the offloading options for the NIC.

Problem #1: Packets seem to get corrupted out of the NIC.  For instance, I could see arp broadcasts coming in to the NIC from other machines, and see valid arp replies going out, but the receiving other computers would receive corrupted arp packets and list "invalid" in their arp tables, or a MAC of all zeroes.

If I disable the VLAN and put my NIC on an untagged switch port like a normal non-VLAN computer, there is no problem.  All works fine.  So the problem is VLAN-specific.

I eventually tried turning off rxvlan offloading on my NIC using ethtool:
ethtool -K eth1 rxvlan off
That appears to turn off both rxvlan and txvlan automatically.  Magically the problem then goes away and all works fine.

Before this I was just letting ethtool/etc default those values and all worked fine.  This bug did not exist in Fedora 15 (kernel-PAE-2.6.40.6-0.fc15.i686).  This bug exists in Fedora 16 (kernel-PAE-3.1.1-1.fc16.i686).

Problem #2:
After fixing the above bug, another bug hit.  Intermittent packets would get corrupted/dropped/whatever and I'd get the error by the thousands:
Nov 15 20:09:09 pog kernel: [10155.006376] e1000 0000:05:00.0: eth1: checksum_partial proto=81!

Disabling tx offloading fixed this problem.  Again, the ethtool offload defaults used to all work fine, never a problem.

Now my ifcfg-eth1 file has a line like:
ETHTOOL_OPTS='-K eth1 rxvlan off tx off'

Now it all seems stable and working.

The NIC is e1000 on PCI card (not express):
05:00.0 Ethernet controller: Intel Corporation 82545GM Gigabit Ethernet Controller (rev 04)

I think something was majorly broken in Intel NIC e1000 offloading between kernel 2.x and 3.x.


Version-Release number of selected component (if applicable):
3.1.1-1.fc16.i686.PAE

How reproducible:
Always

Steps to Reproduce:
1. setup a tagged vlan on the nic
2. leave offload at defaults (rxvlan on txvlan on tx on)
3. make some traffic out of the nic, like ping
  
Actual results:
no packets get through properly (problem #1) and/or corruption (problem #2)

Expected results:
working nic with valid packets traversing

Additional info:

Comment 1 William H. Haller 2011-11-21 18:46:15 UTC
This also appears to be happening with the latest F15 kernel 2.6.41.1-1.fc15.x86_64. The offloading solution provided above also fixes the problem with this kernel.

In my case, I'm running VLANs on top of a bonded connection through a dual port Intel 82546EB Gigabit card using the e1000 driver.

The existing setup worked fine with both kernel-2.6.38.6-26.rc1.fc15.x86_64 and kernel-2.6.40.6-0.fc15.x86_64 so something in the latest push is the culprit.

Comment 2 Josh Boyer 2011-11-21 19:13:41 UTC
OK.  2.6.40 is really 3.0, so something seems to have changed between 3.0 and 3.1 to cause this.

Comment 3 Tushar Dave 2011-11-28 19:48:32 UTC
Sorry that you have issue with card. I am looking into it. 
While I am investigating, would you be willing to try e1000 standalone driver from Sourceforge and see if issue reproduce?

Comment 4 Trevor Cordes 2011-11-29 14:24:58 UTC
I can't seem to make the standalone e1000.  Maybe it doesn't support kernel 3.1?  Should I try an unstable version?

make -C /lib/modules/3.1.1-1.fc16.i686.PAE/build SUBDIRS=/tmp/e1000-8.0.35/src modules
make[1]: Entering directory `/usr/src/kernels/3.1.1-1.fc16.i686.PAE'
  CC [M]  /tmp/e1000-8.0.35/src/e1000_main.o
/tmp/e1000-8.0.35/src/e1000_main.c: In function ���e1000_update_mng_vlan���:
/tmp/e1000-8.0.35/src/e1000_main.c:376:3: error: implicit declaration of function ���vlan_group_get_device��� [-Werror=implicit-function-declaration]
/tmp/e1000-8.0.35/src/e1000_main.c: At top level:
/tmp/e1000-8.0.35/src/e1000_main.c:702:2: error: unknown field ���ndo_vlan_rx_register��� specified in initializer
/tmp/e1000-8.0.35/src/e1000_main.c:702:2: warning: initialization from incompatible pointer type [enabled by default]
/tmp/e1000-8.0.35/src/e1000_main.c:702:2: warning: (near initialization for ���e1000_netdev_ops.ndo_do_ioctl���) [enabled by default]
/tmp/e1000-8.0.35/src/e1000_main.c: In function ���e1000_receive_skb���:
/tmp/e1000-8.0.35/src/e1000_main.c:3725:3: error: implicit declaration of function ���vlan_gro_receive��� [-Werror=implicit-function-declaration]
/tmp/e1000-8.0.35/src/e1000_main.c: In function ���e1000_vlan_rx_kill_vid���:
/tmp/e1000-8.0.35/src/e1000_main.c:4606:2: error: implicit declaration of function ���vlan_group_set_device��� [-Werror=implicit-function-declaration]
cc1: some warnings being treated as errors

make[2]: *** [/tmp/e1000-8.0.35/src/e1000_main.o] Error 1
make[1]: *** [_module_/tmp/e1000-8.0.35/src] Error 2
make[1]: Leaving directory `/usr/src/kernels/3.1.1-1.fc16.i686.PAE'

Comment 5 Tushar Dave 2011-11-29 17:21:08 UTC
You probabaly are right. I have not tested it with kernel 3.1 yet. I am going to repro the issue in lab today, will send update.

(FYI:This is possible because we are EOL e1000 out-of-tree driver and only support in-kernel e1000 driver. e1000-8.0.35 probabaly lastly tested with kernel-3.0)

Comment 6 Tushar Dave 2011-12-01 02:08:24 UTC
A quick look into e1000e and igb revels that e1000 driver doesn't have logic to look into the vlan header to pick out encapsulated protocol. 
Would you please try following patch?


Signed-off-by: Tushar Dave <tushar.n.dave>
---

 drivers/net/ethernet/intel/e1000/e1000_main.c |    8 +++++++-
 1 files changed, 7 insertions(+), 1 deletions(-)

diff --git a/drivers/net/ethernet/intel/e1000/e1000_main.c b/drivers/net/ethernet/intel/e1000/e1000_main.c
index cf480b5..f087771 100644
--- a/drivers/net/ethernet/intel/e1000/e1000_main.c
+++ b/drivers/net/ethernet/intel/e1000/e1000_main.c
@@ -2777,10 +2777,16 @@ static bool e1000_tx_csum(struct e1000_adapter *adapter,
 	unsigned int i;
 	u8 css;
 	u32 cmd_len = E1000_TXD_CMD_DEXT;
+	__be16 protocol;
 
 	if (skb->ip_summed != CHECKSUM_PARTIAL)
 		return false;
 
+	if (skb->protocol == cpu_to_be16(ETH_P_8021Q))
+		protocol = vlan_eth_hdr(skb)->h_vlan_encapsulated_proto;
+	else
+		protocol = skb->protocol;
+
 	switch (skb->protocol) {
 	case cpu_to_be16(ETH_P_IP):
 		if (ip_hdr(skb)->protocol == IPPROTO_TCP) @@ -2794,7 +2800,7 @@ static bool e1000_tx_csum(struct e1000_adapter *adapter,
 	default:
 		if (unlikely(net_ratelimit()))
 			e_warn(drv, "checksum_partial proto=%x!\n",
-			       skb->protocol);
+			       be16_to_cpu(protocol));
 		break;
 	}

Comment 7 Tushar Dave 2011-12-01 02:09:50 UTC
Created attachment 538885 [details]
e1000_main.c.patch

Comment 8 Tushar Dave 2011-12-05 19:03:30 UTC
Got any luck with patch?

Comment 9 Trevor Cordes 2011-12-14 11:37:52 UTC
Sorry, will try to try it soon.  No time lately and it takes a lot of effort get the production system to the state where I can rmmod and insmod on its main lan adapter.

Comment 10 Dave Jones 2012-03-22 16:58:32 UTC
[mass update]
kernel-3.3.0-4.fc16 has been pushed to the Fedora 16 stable repository.
Please retest with this update.

Comment 11 Dave Jones 2012-03-22 17:02:45 UTC
[mass update]
kernel-3.3.0-4.fc16 has been pushed to the Fedora 16 stable repository.
Please retest with this update.

Comment 12 Dave Jones 2012-03-22 17:13:41 UTC
[mass update]
kernel-3.3.0-4.fc16 has been pushed to the Fedora 16 stable repository.
Please retest with this update.

Comment 13 William H. Haller 2012-03-22 18:38:35 UTC
No change. I still can't access VLANed connections to switches until I execute a rxvlan off and tx off on each bond and ethx interface with ethtool. ssh links give no route to host. After I do that, connections seem fine.

Comment 14 Jesse Brandeburg 2012-03-22 18:50:05 UTC
maybe this is related to 
https://sourceforge.net/tracker/?func=detail&atid=447449&aid=3495944&group_id=42302

Comment 15 William H. Haller 2012-03-22 19:09:29 UTC
I normally run with MTU of 9000 which would be over the limit noted there. Dropping back to 1500 didn't affect the problem.

Comment 16 Tushar Dave 2012-03-23 01:26:27 UTC
Did you try the patch I mentioned in comment #6?

Comment 17 William H. Haller 2012-03-23 14:12:55 UTC
No. I did not. I am in the position of the original reporter. The box this is running on is the primary DNS server for the company. I can't drop it to test patches. I had assumed since we were requested to test this kernel by Dave that the patch was in since we hadn't been asked to try out any of the intervening kernels.

I would think this would be easy enough to set up on your end and make sure it works and apply the patch to the kernel. We aren't doing anything magic.

Comment 18 Trevor Cordes 2012-03-28 13:20:49 UTC
Sorry for the delay.

My MTU is 1518, I guess I have it higher than 1500 for VLAN tagging overhead.  I used to run 9000 but it gave me far worse LAN performance than 1500(!!!) so I switched back.  Never did solve why (Linksys web-smart Gb switch, all Intel server or high-end DT NICs).

I will try the patch in comment #6 to see if I can now compile.

Comment 19 Trevor Cordes 2012-03-28 14:19:22 UTC
Sorry, noob problems.  I don't do much kernel building.  I know how to patch & compile the whole kernel, but I'm wondering if I can just build/make the e1000 directory without doing the whole massive kernel.

The makefile in e1000 seems deficient (I can't figure out what target to make to make it do anything but error out).

Any tips?  I'd love to just compile the e1000 stuff and rmmod/insmod it rather that sitting here 2 hours for the kernel to build.

Thanks!

Comment 20 William H. Haller 2012-03-28 14:41:07 UTC
We have some extremely large file moves here moving server backups to other buildings on a daily basis. I'm not sure the 1500/9000 is the only help compared to increasing buffer size and some other tweaks, but it does seem to help the transfer rate for those cases.

Just be sure MTU probing is enabled if you run a mixed network.

We ran 63.75 MB/sec on a 50 GB transfer with normal network loads. Not as fast as you could go if you could afford faster than gigabit on copper connections, but it's better than sneaker net.

Comment 21 Tushar Dave 2012-03-28 16:36:39 UTC
(In reply to comment #19)
> Sorry, noob problems.  I don't do much kernel building.  I know how to patch &
> compile the whole kernel, but I'm wondering if I can just build/make the e1000
> directory without doing the whole massive kernel.
> The makefile in e1000 seems deficient (I can't figure out what target to make
> to make it do anything but error out).
> Any tips?  I'd love to just compile the e1000 stuff and rmmod/insmod it rather
> that sitting here 2 hours for the kernel to build.
> Thanks!

In order to compile only e1000 module inside linux kernel source , You can do:
# cd  linux_src_directory
# make M=driver/net/ethernet/intel/e1000/

let me know if you need help.

Comment 22 Tushar Dave 2012-04-05 18:43:19 UTC
(In reply to comment #17)
> No. I did not. I am in the position of the original reporter. The box this is
> running on is the primary DNS server for the company. I can't drop it to test
> patches. I had assumed since we were requested to test this kernel by Dave that
> the patch was in since we hadn't been asked to try out any of the intervening
> kernels.
> 
> I would think this would be easy enough to set up on your end and make sure it
> works and apply the patch to the kernel. We aren't doing anything magic.

I have built kernel binary rpm with e1000 patch for you so you can install and test it. However, the file is more than 20 MB so cannot attached it to BZ. Do you have place so that I can upload file. Or I need to find a place from where you can download.

Comment 23 Andrew J. Schorr 2012-05-14 18:52:14 UTC
I applied the supplied patch to kernel-3.3.5-2.fc16.x86_64, but it did
not fix the problem.

This did not surprise me, since if you inspect the patch, it does not
appear to do anything meaningful.  It calculates a new local "protocol" value,
but that value is not used anywhere except in a warning message.  Am I missing
something?

The suggested workaround (ethtool -K eth1 rxvlan off tx off) seems to work.

This is a huge bug.  Is there anything that can be done to expedite a fix?

Thanks,
Andy

Comment 24 Andrew J. Schorr 2012-05-14 19:19:31 UTC
Is this the correct patch?

http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=commitdiff;h=52f5509fe8ccb607ff9b84ad618f244262336475

It looks like it is fixed in 3.3.6.  Is that coming soon for Fedora 16?

Thanks,
Andy

Comment 25 Trevor Cordes 2012-05-14 20:05:56 UTC
Sorry, again delayed with a chance to work on this.  Due to comment #23 I guess it's not worth testing the patch?

Andrew's git patch link does look promising and very relevant.  Perhaps this will fix it.  Should we just wait and see when 3.3.6 gets pushed to F16?

Comment 26 Andrew J. Schorr 2012-05-14 20:50:27 UTC
I just rebuilt 3.3.5-2.fc16.x86_64 with the patch I mentioned above in comment #24.  That seems to fix the problem.

It would be great to get a patched kernel pushed out for Fedora 16, since I really prefer not to maintain my own kernel versions.

Thanks,
Andy

Comment 27 Josh Boyer 2012-05-14 21:15:59 UTC
(In reply to comment #26)
> I just rebuilt 3.3.5-2.fc16.x86_64 with the patch I mentioned above in comment
> #24.  That seems to fix the problem.
> 
> It would be great to get a patched kernel pushed out for Fedora 16, since I
> really prefer not to maintain my own kernel versions.

3.3.6 has already been committed to Fedora.  It's building here:

http://koji.fedoraproject.org/koji/buildinfo?buildID=318861

Comment 28 Tushar Dave 2012-05-14 22:09:48 UTC
(In reply to comment #26)
> I just rebuilt 3.3.5-2.fc16.x86_64 with the patch I mentioned above in comment
> #24.  That seems to fix the problem.
> It would be great to get a patched kernel pushed out for Fedora 16, since I
> really prefer not to maintain my own kernel versions.
> Thanks,
> Andy

Andy,

Thanks for testing this.

Comment 29 Andrew J. Schorr 2012-05-17 21:02:05 UTC
FYI, I installed the 3.3.6 kernel from comment #27, and everything seems to work fine so far.

Thanks,
Andy

Comment 30 Fedora Update System 2012-05-18 00:11:50 UTC
kernel-3.3.6-3.fc16 has been submitted as an update for Fedora 16.
https://admin.fedoraproject.org/updates/kernel-3.3.6-3.fc16

Comment 31 Fedora Update System 2012-05-18 10:38:22 UTC
Package kernel-3.3.6-3.fc16:
* should fix your issue,
* was pushed to the Fedora 16 testing repository,
* should be available at your local mirror within two days.
Update it with:
# su -c 'yum update --enablerepo=updates-testing kernel-3.3.6-3.fc16'
as soon as you are able to, then reboot.
Please go to the following url:
https://admin.fedoraproject.org/updates/FEDORA-2012-8074/kernel-3.3.6-3.fc16
then log in and leave karma (feedback).

Comment 32 Fedora Update System 2012-05-22 02:22:25 UTC
kernel-3.3.6-3.fc16 has been pushed to the Fedora 16 stable repository.  If problems still persist, please make note of it in this bug report.


Note You need to log in before you can comment on or make changes to this bug.