Bug 619070
Summary: | 802.3ad link aggregation won't work with newer (2.6.194-8.1.el5) kernel and ixgbe driver | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 5 | Reporter: | Doug Wandell <doug> |
Component: | kernel | Assignee: | Andy Gospodarek <agospoda> |
Status: | CLOSED ERRATA | QA Contact: | Network QE <network-qe> |
Severity: | high | Docs Contact: | |
Priority: | urgent | ||
Version: | 5.5 | CC: | alexander.h.duyck, andriusb, arozansk, bandan.das, bluca, cevich, chas.horvath, cward, cww, dan.duval, dhoward, greg.procunier, hjia, jesse.brandeburg, john.ronciak, jonathansturges, jparadis, jpirko, peterm, robert.evans, syeghiay |
Target Milestone: | rc | Keywords: | ZStream |
Target Release: | --- | ||
Hardware: | x86_64 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: |
Previously, using 802.3ad link aggregation did not work properly when using the ixgbe driver. This was caused due to an inability to form 802.3ad-based bonds. With this update, the issue causing 802.3ad link aggregation to not work properly has been fixed.
|
Story Points: | --- |
Clone Of: | Environment: | ||
Last Closed: | 2011-01-13 21:45:34 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 644822 | ||
Attachments: |
Description
Doug Wandell
2010-07-28 14:01:41 UTC
Created attachment 435018 [details]
Dmesg tail after issuing 'ifup bond0'
Created attachment 435019 [details]
New kernel dmesg tail after issuing 'ifup bond0'
I am also getting this exact same problem trying to enable mode=4 (802.3ad link aggregation. cat /proc/net/bonding/bond0 http://img839.imageshack.us/img839/2317/8023adproblem.jpg service network stop http://img412.imageshack.us/img412/8517/netstop.jpg service network start http://img96.imageshack.us/img96/1560/netstart.jpg Running 2.6.18-194.el5 RHEL5-u5 x86_64 smp kernel. I updated to the following driver from intel and my trunking issue was fixed: # modinfo ixgbe filename: /lib/modules/2.6.18-194.8.1.el5/kernel/drivers/net/ixgbe/ixgbe.ko version: 2.0.84.11-NAPI license: GPL description: Intel(R) 10 Gigabit PCI Express Network Driver author: Intel Corporation, <linux.nics> srcversion: 4E1775748069A498875EA2E # cat /proc/net/bonding/bond0 Ethernet Channel Bonding Driver: v3.4.0 (October 7, 2008) Bonding Mode: IEEE 802.3ad Dynamic link aggregation Transmit Hash Policy: layer3+4 (1) MII Status: up MII Polling Interval (ms): 100 Up Delay (ms): 0 Down Delay (ms): 0 802.3ad info LACP rate: fast Active Aggregator Info: Aggregator ID: 1 Number of ports: 2 Actor Key: 9 Partner Key: 2 Partner Mac Address: 00:24:98:ed:2a:80 Slave Interface: eth0 MII Status: up Link Failure Count: 0 Permanent HW addr: 00:1b:21:66:a0:dc Aggregator ID: 1 Slave Interface: eth2 MII Status: up Link Failure Count: 1 Permanent HW addr: 00:1b:21:66:a4:04 Aggregator ID: 1 Conclusion, the problem exists in Red Hats bundled ixgbe driver with their kernel. I realize it seems like this is something that is fixed with the latest ixgbe driver from Intel, but this appears to be way too much like bug 567604 for me to discredit this patch as a fix: http://people.redhat.com/agospoda/rhel5/0208-bonding-Fix-updating-of-speed-duplex-changes.patch My test kernels not only contain that patch but an ixgbe update too, so I suspect the issue would be resolved if running those kernels. They can be found here: http://people.redhat.com/agospoda/#rhel5 The latest development kernels contain only the fix for bug 567604 (as of right now) can be found here: http://people.redhat.com/jwilson/el5/ If you are able to test both and report the results here I would *really* appreciate it. Andy, I tested both kernels and neither made a difference for me. # uname -a Linux rhev-prod-node6.mitre.org 2.6.18-212.el5.gtest.89 #1 SMP Mon Aug 16 14:01:15 EDT 2010 x86_64 x86_64 x86_64 GNU/Linux # modinfo ixgbe filename: /lib/modules/2.6.18-212.el5.gtest.89/kernel/drivers/net/ixgbe/ixgbe.ko version: 2.0.84-k2 ... The /proc/net/bonding/bond0 information looks the same as before: # cat /proc/net/bonding/bond0 Ethernet Channel Bonding Driver: v3.4.0 (October 7, 2008) Bonding Mode: IEEE 802.3ad Dynamic link aggregation Transmit Hash Policy: layer2 (0) MII Status: up MII Polling Interval (ms): 150 Up Delay (ms): 0 Down Delay (ms): 0 802.3ad info LACP rate: slow Active Aggregator Info: Aggregator ID: 1 Number of ports: 1 Actor Key: 33 Partner Key: 1 Partner Mac Address: 00:00:00:00:00:00 Slave Interface: eth8 MII Status: up Link Failure Count: 0 Permanent HW addr: 00:1b:21:54:f7:5c Aggregator ID: 1 Slave Interface: eth9 MII Status: up Link Failure Count: 0 Permanent HW addr: 00:1b:21:54:f7:5d Aggregator ID: 2 Thanks for testing that Doug -- sorry neither kernel worked for you. I think I see what's wrong and will test and post a patch shortly. It turns out the problem I thought I saw was not really a problem. I don't have a switch for testing, but I can confirm with another system that the 802.3ad bonds are not getting setup correctly. It looks like something is incorrect with the packet-split code in the driver as skb->data is wrong when the 802.3ad bonding code begins to inspect it. I've even found that if I clear IXGBE_PSRTYPE_L2HDR from psrtype that everything works, so I'm looking around for changes in rx buffer setup that may be causing this. I think this is probably a bug in the link aggregation code that probably also needs to be fixed in the upstream kernel. The issue is that bond_3ad_lacpdu_recv/bond_3ad_rx_indication are expecting a linear skb, but the 82599 is splitting the packet at the L2 header and is then placing the LACP data in a separate page. If you add a call to skb_linearize in bond_3ad_lacpdu_recv before you call the bond_3ad_rx_indication it should resolve the issue. I would tend to agree that a call to skb_lineraize would resolve this. The fact that the reporter tested ixgbe-2.0.84.11 on RHEL5.4 and RHEL5.5 and it worked makes me wonder if we've got something wrong in our backport. Created attachment 446309 [details]
possible missing fix from ixgbe
Andy, does your code have this patch or similar?
instead of calling linearize, what about just pulling the bytes into the skb->data using pskb_pull or skb_pull or whatever call? That should work too. You could just pull the sizeof(struct lacpdu). The issue is that the L2 header split was added to support FCoE but it is going to expose any protocol handlers that don't correctly handle non-linear frames. Jesse, we do have the patch added in comment #13. After some more testing it seems the SF driver (2.0.84.11) doesn't actually enable packet split, so that explains things. :) Andy, do you want to submit the upstream patch for this or should I? Basically all that needs to be done is to add the following snippit to bond_3ad_lacpdu_recv before it grabs the bond->lock: if (!pskb_may_pull(skb, sizeof(struct lacpdu))) goto out; I don't have the setup here in front of me to test it so it might be easier for you to reproduce it, verify the fix, and send the patch from your end. Alexander, I actually tested this against an upstream kernel and did not find it to be broken, so I'd like to figure out why everything works there before posting a fix upstream. I tested something similar to what is posted in comment#18 on RHEL5.6 development kernels and as you suspected linearizing the skb resolved the issue. (In reply to comment #19) > Alexander, I actually tested this against an upstream kernel and did not find > it to be broken, so I'd like to figure out why everything works there before > posting a fix upstream. > I did a bit more snooping around and my upstream kernel on this system was old enough that it did not have this fix: commit 486545216472d67c16e3d3d60c5f21f60959c855 Author: Alexander Duyck <alexander.h.duyck> Date: Thu Aug 19 13:36:27 2010 +0000 ixgbe: pull PSRTYPE configuration into a separate function so packet split was only enabled on queue 0 and there seemed to be some cases where the frames were being received on a queue other than queue 0 (at least that had to be the case for it to work in the past). I definitely agree that we need a solution to handle the non-linear frames now coming out of the ixgbe interfaces upstream (I didn't doubt Alex before, but I wanted to be sure I understood why my upstream stuff appeared to be working before), but I question whether or not this is the correct solution for RHEL5 as I could just turn off packet split for L2 frames since we aren't supporting FCoE on RHEL5. Created attachment 446535 [details]
bonding-correct-LACPDUs-that-are-in-non-linear-skbs.patch
This will likely be the patch to add to RHEL5 and upstream.
Thanks to Intel for the help on this.
patch posted upstream: http://marc.info/?l=linux-netdev&m=128415190309022&w=2 My test kernels have been updated to include a patch for this bugzilla. Please test them and report back your results. http://people.redhat.com/agospoda/#rhel5 Without immediate feedback there is a good chance this or any other fix for this driver will not be included in the upcoming update. Please test them and report back your results. I've been having fits trying to get LACP to work using some 82599EB cards and Cisco Nexus 5010 switches in recent weeks. I'd tried the stock RHEL5.5 ixgbe (2.0.44-k2) and also a 2.1.4 driver from Intel with no luck. Though I've just begun testing the kernel linked to in Comment 23, they do indeed seem to fix the problem! # lspci | grep 82599: 08:00.0 Ethernet controller: Intel Corporation 82599EB 10-Gigabit Network Connection (rev 01) 08:00.1 Ethernet controller: Intel Corporation 82599EB 10-Gigabit Network Connection (rev 01) 09:00.0 Ethernet controller: Intel Corporation 82599EB 10-Gigabit Network Connection (rev 01) 09:00.1 Ethernet controller: Intel Corporation 82599EB 10-Gigabit Network Connection (rev 01) # uname -a Linux test_node 2.6.18-223.el5.gtest.90 #1 SMP Thu Sep 23 11:18:30 EDT 2010 x86_64 x86_64 x86_64 GNU/Linux # cat /proc/net/bonding/bond0 Ethernet Channel Bonding Driver: v3.4.0 (October 7, 2008) Bonding Mode: IEEE 802.3ad Dynamic link aggregation Transmit Hash Policy: layer2+3 (2) MII Status: up MII Polling Interval (ms): 100 Up Delay (ms): 0 Down Delay (ms): 0 802.3ad info LACP rate: fast Active Aggregator Info: Aggregator ID: 1 Number of ports: 2 Actor Key: 33 Partner Key: 32970 Partner Mac Address: 00:23:04:ee:be:03 Slave Interface: eth2 MII Status: up Link Failure Count: 0 Permanent HW addr: 00:e0:ed:0e:cd:ee Aggregator ID: 1 Slave Interface: eth4 MII Status: up Link Failure Count: 0 Permanent HW addr: 00:e0:ed:19:47:60 Aggregator ID: 1 (In reply to comment #24) > I've been having fits trying to get LACP to work using some 82599EB cards and > Cisco Nexus 5010 switches in recent weeks. I'd tried the stock RHEL5.5 ixgbe > (2.0.44-k2) and also a 2.1.4 driver from Intel with no luck. > Though I've just begun testing the kernel linked to in Comment 23, they do > indeed seem to fix the problem! > Glad to hear it! The problem was actually in the bonding driver not the ixgbe driver, so until puts a workaround in their SourceForge driver, it will still be broken. I'll see what I can do to get this into RHEL5.6 (no promises though). This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release. Stratus has encountered this problem also. in kernel-2.6.18-229.el5 You can download this test kernel (or newer) from http://people.redhat.com/jwilson/el5 Detailed testing feedback is always welcomed. (In reply to comment #34) > in kernel-2.6.18-229.el5 > You can download this test kernel (or newer) from > http://people.redhat.com/jwilson/el5 > > Detailed testing feedback is always welcomed. i was stomped today on a possibly related problem, my environment is based on NX3031 nics (netxen_nic driver) with bonding in active-backup mode at system boot the bond is apparently created but no traffic comes through putting the interface in promiscuous mode (eg by tcpdump) makes it work again, until promiscuous is disabled, also issuing 'service network restart' fixes it. I tried updating to kernel from http://people.redhat.com/jwilson/el5/231.el5/ and the problem seems solved for good. L. Technical note added. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: Previously, using 802.3ad link aggregation did not work properly when using the ixgbe driver. This was caused due to an inability to form 802.3ad-based bonds. With this update, the issue causing 802.3ad link aggregation to not work properly has been fixed. An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2011-0017.html |