Description of problem: After upgrade to Fedora 16, one of my NICs apparently stopped working. After much investigation, I tracked it down to the offloading options for the NIC. Problem #1: Packets seem to get corrupted out of the NIC. For instance, I could see arp broadcasts coming in to the NIC from other machines, and see valid arp replies going out, but the receiving other computers would receive corrupted arp packets and list "invalid" in their arp tables, or a MAC of all zeroes. If I disable the VLAN and put my NIC on an untagged switch port like a normal non-VLAN computer, there is no problem. All works fine. So the problem is VLAN-specific. I eventually tried turning off rxvlan offloading on my NIC using ethtool: ethtool -K eth1 rxvlan off That appears to turn off both rxvlan and txvlan automatically. Magically the problem then goes away and all works fine. Before this I was just letting ethtool/etc default those values and all worked fine. This bug did not exist in Fedora 15 (kernel-PAE-2.6.40.6-0.fc15.i686). This bug exists in Fedora 16 (kernel-PAE-3.1.1-1.fc16.i686). Problem #2: After fixing the above bug, another bug hit. Intermittent packets would get corrupted/dropped/whatever and I'd get the error by the thousands: Nov 15 20:09:09 pog kernel: [10155.006376] e1000 0000:05:00.0: eth1: checksum_partial proto=81! Disabling tx offloading fixed this problem. Again, the ethtool offload defaults used to all work fine, never a problem. Now my ifcfg-eth1 file has a line like: ETHTOOL_OPTS='-K eth1 rxvlan off tx off' Now it all seems stable and working. The NIC is e1000 on PCI card (not express): 05:00.0 Ethernet controller: Intel Corporation 82545GM Gigabit Ethernet Controller (rev 04) I think something was majorly broken in Intel NIC e1000 offloading between kernel 2.x and 3.x. Version-Release number of selected component (if applicable): 3.1.1-1.fc16.i686.PAE How reproducible: Always Steps to Reproduce: 1. setup a tagged vlan on the nic 2. leave offload at defaults (rxvlan on txvlan on tx on) 3. make some traffic out of the nic, like ping Actual results: no packets get through properly (problem #1) and/or corruption (problem #2) Expected results: working nic with valid packets traversing Additional info:
This also appears to be happening with the latest F15 kernel 2.6.41.1-1.fc15.x86_64. The offloading solution provided above also fixes the problem with this kernel. In my case, I'm running VLANs on top of a bonded connection through a dual port Intel 82546EB Gigabit card using the e1000 driver. The existing setup worked fine with both kernel-2.6.38.6-26.rc1.fc15.x86_64 and kernel-2.6.40.6-0.fc15.x86_64 so something in the latest push is the culprit.
OK. 2.6.40 is really 3.0, so something seems to have changed between 3.0 and 3.1 to cause this.
Sorry that you have issue with card. I am looking into it. While I am investigating, would you be willing to try e1000 standalone driver from Sourceforge and see if issue reproduce?
I can't seem to make the standalone e1000. Maybe it doesn't support kernel 3.1? Should I try an unstable version? make -C /lib/modules/3.1.1-1.fc16.i686.PAE/build SUBDIRS=/tmp/e1000-8.0.35/src modules make[1]: Entering directory `/usr/src/kernels/3.1.1-1.fc16.i686.PAE' CC [M] /tmp/e1000-8.0.35/src/e1000_main.o /tmp/e1000-8.0.35/src/e1000_main.c: In function ���e1000_update_mng_vlan���: /tmp/e1000-8.0.35/src/e1000_main.c:376:3: error: implicit declaration of function ���vlan_group_get_device��� [-Werror=implicit-function-declaration] /tmp/e1000-8.0.35/src/e1000_main.c: At top level: /tmp/e1000-8.0.35/src/e1000_main.c:702:2: error: unknown field ���ndo_vlan_rx_register��� specified in initializer /tmp/e1000-8.0.35/src/e1000_main.c:702:2: warning: initialization from incompatible pointer type [enabled by default] /tmp/e1000-8.0.35/src/e1000_main.c:702:2: warning: (near initialization for ���e1000_netdev_ops.ndo_do_ioctl���) [enabled by default] /tmp/e1000-8.0.35/src/e1000_main.c: In function ���e1000_receive_skb���: /tmp/e1000-8.0.35/src/e1000_main.c:3725:3: error: implicit declaration of function ���vlan_gro_receive��� [-Werror=implicit-function-declaration] /tmp/e1000-8.0.35/src/e1000_main.c: In function ���e1000_vlan_rx_kill_vid���: /tmp/e1000-8.0.35/src/e1000_main.c:4606:2: error: implicit declaration of function ���vlan_group_set_device��� [-Werror=implicit-function-declaration] cc1: some warnings being treated as errors make[2]: *** [/tmp/e1000-8.0.35/src/e1000_main.o] Error 1 make[1]: *** [_module_/tmp/e1000-8.0.35/src] Error 2 make[1]: Leaving directory `/usr/src/kernels/3.1.1-1.fc16.i686.PAE'
You probabaly are right. I have not tested it with kernel 3.1 yet. I am going to repro the issue in lab today, will send update. (FYI:This is possible because we are EOL e1000 out-of-tree driver and only support in-kernel e1000 driver. e1000-8.0.35 probabaly lastly tested with kernel-3.0)
A quick look into e1000e and igb revels that e1000 driver doesn't have logic to look into the vlan header to pick out encapsulated protocol. Would you please try following patch? Signed-off-by: Tushar Dave <tushar.n.dave> --- drivers/net/ethernet/intel/e1000/e1000_main.c | 8 +++++++- 1 files changed, 7 insertions(+), 1 deletions(-) diff --git a/drivers/net/ethernet/intel/e1000/e1000_main.c b/drivers/net/ethernet/intel/e1000/e1000_main.c index cf480b5..f087771 100644 --- a/drivers/net/ethernet/intel/e1000/e1000_main.c +++ b/drivers/net/ethernet/intel/e1000/e1000_main.c @@ -2777,10 +2777,16 @@ static bool e1000_tx_csum(struct e1000_adapter *adapter, unsigned int i; u8 css; u32 cmd_len = E1000_TXD_CMD_DEXT; + __be16 protocol; if (skb->ip_summed != CHECKSUM_PARTIAL) return false; + if (skb->protocol == cpu_to_be16(ETH_P_8021Q)) + protocol = vlan_eth_hdr(skb)->h_vlan_encapsulated_proto; + else + protocol = skb->protocol; + switch (skb->protocol) { case cpu_to_be16(ETH_P_IP): if (ip_hdr(skb)->protocol == IPPROTO_TCP) @@ -2794,7 +2800,7 @@ static bool e1000_tx_csum(struct e1000_adapter *adapter, default: if (unlikely(net_ratelimit())) e_warn(drv, "checksum_partial proto=%x!\n", - skb->protocol); + be16_to_cpu(protocol)); break; }
Created attachment 538885 [details] e1000_main.c.patch
Got any luck with patch?
Sorry, will try to try it soon. No time lately and it takes a lot of effort get the production system to the state where I can rmmod and insmod on its main lan adapter.
[mass update] kernel-3.3.0-4.fc16 has been pushed to the Fedora 16 stable repository. Please retest with this update.
No change. I still can't access VLANed connections to switches until I execute a rxvlan off and tx off on each bond and ethx interface with ethtool. ssh links give no route to host. After I do that, connections seem fine.
maybe this is related to https://sourceforge.net/tracker/?func=detail&atid=447449&aid=3495944&group_id=42302
I normally run with MTU of 9000 which would be over the limit noted there. Dropping back to 1500 didn't affect the problem.
Did you try the patch I mentioned in comment #6?
No. I did not. I am in the position of the original reporter. The box this is running on is the primary DNS server for the company. I can't drop it to test patches. I had assumed since we were requested to test this kernel by Dave that the patch was in since we hadn't been asked to try out any of the intervening kernels. I would think this would be easy enough to set up on your end and make sure it works and apply the patch to the kernel. We aren't doing anything magic.
Sorry for the delay. My MTU is 1518, I guess I have it higher than 1500 for VLAN tagging overhead. I used to run 9000 but it gave me far worse LAN performance than 1500(!!!) so I switched back. Never did solve why (Linksys web-smart Gb switch, all Intel server or high-end DT NICs). I will try the patch in comment #6 to see if I can now compile.
Sorry, noob problems. I don't do much kernel building. I know how to patch & compile the whole kernel, but I'm wondering if I can just build/make the e1000 directory without doing the whole massive kernel. The makefile in e1000 seems deficient (I can't figure out what target to make to make it do anything but error out). Any tips? I'd love to just compile the e1000 stuff and rmmod/insmod it rather that sitting here 2 hours for the kernel to build. Thanks!
We have some extremely large file moves here moving server backups to other buildings on a daily basis. I'm not sure the 1500/9000 is the only help compared to increasing buffer size and some other tweaks, but it does seem to help the transfer rate for those cases. Just be sure MTU probing is enabled if you run a mixed network. We ran 63.75 MB/sec on a 50 GB transfer with normal network loads. Not as fast as you could go if you could afford faster than gigabit on copper connections, but it's better than sneaker net.
(In reply to comment #19) > Sorry, noob problems. I don't do much kernel building. I know how to patch & > compile the whole kernel, but I'm wondering if I can just build/make the e1000 > directory without doing the whole massive kernel. > The makefile in e1000 seems deficient (I can't figure out what target to make > to make it do anything but error out). > Any tips? I'd love to just compile the e1000 stuff and rmmod/insmod it rather > that sitting here 2 hours for the kernel to build. > Thanks! In order to compile only e1000 module inside linux kernel source , You can do: # cd linux_src_directory # make M=driver/net/ethernet/intel/e1000/ let me know if you need help.
(In reply to comment #17) > No. I did not. I am in the position of the original reporter. The box this is > running on is the primary DNS server for the company. I can't drop it to test > patches. I had assumed since we were requested to test this kernel by Dave that > the patch was in since we hadn't been asked to try out any of the intervening > kernels. > > I would think this would be easy enough to set up on your end and make sure it > works and apply the patch to the kernel. We aren't doing anything magic. I have built kernel binary rpm with e1000 patch for you so you can install and test it. However, the file is more than 20 MB so cannot attached it to BZ. Do you have place so that I can upload file. Or I need to find a place from where you can download.
I applied the supplied patch to kernel-3.3.5-2.fc16.x86_64, but it did not fix the problem. This did not surprise me, since if you inspect the patch, it does not appear to do anything meaningful. It calculates a new local "protocol" value, but that value is not used anywhere except in a warning message. Am I missing something? The suggested workaround (ethtool -K eth1 rxvlan off tx off) seems to work. This is a huge bug. Is there anything that can be done to expedite a fix? Thanks, Andy
Is this the correct patch? http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=commitdiff;h=52f5509fe8ccb607ff9b84ad618f244262336475 It looks like it is fixed in 3.3.6. Is that coming soon for Fedora 16? Thanks, Andy
Sorry, again delayed with a chance to work on this. Due to comment #23 I guess it's not worth testing the patch? Andrew's git patch link does look promising and very relevant. Perhaps this will fix it. Should we just wait and see when 3.3.6 gets pushed to F16?
I just rebuilt 3.3.5-2.fc16.x86_64 with the patch I mentioned above in comment #24. That seems to fix the problem. It would be great to get a patched kernel pushed out for Fedora 16, since I really prefer not to maintain my own kernel versions. Thanks, Andy
(In reply to comment #26) > I just rebuilt 3.3.5-2.fc16.x86_64 with the patch I mentioned above in comment > #24. That seems to fix the problem. > > It would be great to get a patched kernel pushed out for Fedora 16, since I > really prefer not to maintain my own kernel versions. 3.3.6 has already been committed to Fedora. It's building here: http://koji.fedoraproject.org/koji/buildinfo?buildID=318861
(In reply to comment #26) > I just rebuilt 3.3.5-2.fc16.x86_64 with the patch I mentioned above in comment > #24. That seems to fix the problem. > It would be great to get a patched kernel pushed out for Fedora 16, since I > really prefer not to maintain my own kernel versions. > Thanks, > Andy Andy, Thanks for testing this.
FYI, I installed the 3.3.6 kernel from comment #27, and everything seems to work fine so far. Thanks, Andy
kernel-3.3.6-3.fc16 has been submitted as an update for Fedora 16. https://admin.fedoraproject.org/updates/kernel-3.3.6-3.fc16
Package kernel-3.3.6-3.fc16: * should fix your issue, * was pushed to the Fedora 16 testing repository, * should be available at your local mirror within two days. Update it with: # su -c 'yum update --enablerepo=updates-testing kernel-3.3.6-3.fc16' as soon as you are able to, then reboot. Please go to the following url: https://admin.fedoraproject.org/updates/FEDORA-2012-8074/kernel-3.3.6-3.fc16 then log in and leave karma (feedback).
kernel-3.3.6-3.fc16 has been pushed to the Fedora 16 stable repository. If problems still persist, please make note of it in this bug report.