Description of problem: Adding device support This is a request from IBM to have Niantic support in 5.4.
FSC requests this for 5.4, too.
*** Bug 475580 has been marked as a duplicate of this bug. ***
Updating PM score.
=Comment: #0================================================= Emily J. Ratliff <ratliff.com> - 1. Feature Overview: Feature Id: [201288] a. Name of Feature: Driver update for Intel 10GB - ixgbe b. Feature Description Driver updates to support the Intel 10GB NICS. The drivers are called ixgbe and ixgb. Additional Comments: We require that the ixgbe driver be updated to support the Intel Niantic (Dorado) 10GB NIC. 2. Feature Details: Sponsor: xSeries Architectures: x86 x86_64 Arch Specificity: Purely Common Code Affects Kernel Modules: Yes Delivery Mechanism: Direct from community Category: Kernel Request Type: Driver - Update Version d. Upstream Acceptance: Accepted Sponsor Priority 1 f. Severity: High IBM Confidential: yes Code Contribution: no g. Component Version Target: 2.6.24 3. Business Case Future option support of Intel 10GB adapter will be available on several systems and blades. These drivers need to be updated to support the high speed adapters. 4. Primary contact at Red Hat: John Jarvis jjarvis 5. Primary contacts at Partner: Project Management Contact: Monte Knutson, mknutson.com, 877-894-1495 Technical contact(s): Kevin Stansell, kstansel.com Chris McDermott, mcdermoc.com IBM Manager: Julio Alvarez, julioa.com IBM is signed up to test and provide feedback. *** This bug has been marked as a duplicate of 472547 ***
Gospo - is this the BZ being used for the wholesale ixgbe driver update in 5.4?
Yep, seems like it.
*** Bug 438523 has been marked as a duplicate of this bug. ***
My test kernels have been updated to include a patch for this bugzilla. http://people.redhat.com/agospoda/#rhel5 Please test them and report back your results. Without immediate feedback there is a good chance this or any other fix for this driver will not be included in the upcoming update.
The driver loads (ID are there) on -70 kernel but it only passes a few packets before it hangs. We tried both our Spring Fountain NIC's (SFP+ type NIC with direct attach cables and the IBM Windley Key NICs (KX4). They all did the same thing.
has RH tested with the IBM Windley Key NICs ?
(In reply to comment #20) > The driver loads (ID are there) on -70 kernel but it only passes a few packets > before it hangs. We tried both our Spring Fountain NIC's (SFP+ type NIC with > direct attach cables and the IBM Windley Key NICs (KX4). They all did the same > thing. I tested this on the lone ixgbe-based NIC that I have locally. It's a dual port CX4 82598 and it seemed to work fine when I ran netperf on it for a while (I don't remember how long) using both msi-x and legacy interrupts without any issue. When you see the 'hang' does the kernel hang or does the network interface just stop processing frames? I took another look at my backport and noticed a potential problem in the ixgbe_clean_rxonly_many function that could cause some problems that I would probably not have seen since I was using a system with less cores than yours and wouldn't have to deal with the vector overlap. The patch looks like this: diff --git a/drivers/net/ixgbe/ixgbe_main.c b/drivers/net/ixgbe/ixgbe_main.c index ebf3578..5f271e1 100644 --- a/drivers/net/ixgbe/ixgbe_main.c +++ b/drivers/net/ixgbe/ixgbe_main.c @@ -1109,7 +1109,7 @@ static int ixgbe_clean_rxonly_many(struct net_device *netdev, int *budget) /* If all Rx work done, exit the polling mode */ if ((work_done < work_to_do) || !netif_running(adapter->netdev)) { quit_polling: - netif_rx_complete(adapter->netdev); + netif_rx_complete(netdev); if (adapter->itr_setting & 1) ixgbe_set_itr_msix(q_vector); if (!test_bit(__IXGBE_DOWN, &adapter->state)) I'll apply that fix to my test kernels and get you some new ones. (In reply to comment #21) > has RH tested with the IBM Windley Key NICs ? Nope, we haven't. In the entire company I believe we have 2 ixgbe-based NICs.
Thanks Andy, we'll test the kernel as soon as you get it generated. >> has RH tested with the IBM Windley Key NICs ? >Nope, we haven't. In the entire company I believe we have 2 ixgbe-based NICs. RH was given 4 of these NICs back in March. They are for the IBM Blade Center systems. On the Engr call yesterday Peter M. reported that testing was under way with them. I guess since it would need this driver that it is not actually under way yet. There was a problem with the actual system(s) but IBM got that resolved. So there are Niantic NICs in Westford.
This enhancement request was evaluated by the full Red Hat Enterprise Linux team for inclusion in a Red Hat Enterprise Linux minor release. As a result of this evaluation, Red Hat has tentatively approved inclusion of this feature in the next Red Hat Enterprise Linux Update minor release. While it is a goal to include this enhancement in the next minor release of Red Hat Enterprise Linux, the enhancement is not yet committed for inclusion in the next minor release pending the next phase of actual code integration and successful Red Hat and partner testing.
in kernel-2.6.18-144.el5 You can download this test kernel from http://people.redhat.com/dzickus/el5 Please do NOT transition this bugzilla state to VERIFIED until our QE team has sent specific instructions indicating when to do so. However feel free to provide a comment indicating that this fix has been verified.
Gospo - what's the version ixgbe that was committed to the 5.4 so far?
Andrius, it's version 2.0.8-k2.
Andy, >BAT on ixgbe failed for Niantic (I tested on Spring Fountain). >Driver is unable to ping. Oplin passed. The Spring Fountain NIC is Niantic on a PCIe board. The same as the Windley Key IBM mezz cards that RH already has. Since Oplin is passing, there is something specific to the backport of the Niantic code. This was on the -144 kernel called out above.
This commit also need to make the 5.4 ixgbe driver. ----------------------------------- From: netdev-owner.org [netdev-owner.org] On Behalf Of David Miller Sent: Tuesday, May 19, 2009 2:41 PM To: Kirsher, Jeffrey T Cc: netdev.org; Waskiewicz Jr, Peter P Subject: Re: [net-next-2.6 PATCH 1/3] ixgbe: Add semaphore access for PHY initialization for 82599 From: Jeff Kirsher <jeffrey.t.kirsher> Date: Tue, 19 May 2009 12:18:34 -0700 > The SFP+ NIC (device id 0x10fb) needs a semaphore to serialize > PHY access, so our PHY init code must honor that same semaphore. > > Signed-off-by: Peter P Waskiewicz Jr <peter.p.waskiewicz.jr> > Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher> Applied.
I got some cards, and have been testing them today with some interesting results. It seems that arping works fine (broadcast or unicast), but I cannot ping with the device at all. It's really odd. I'm guessing it has something to do with the mac address initialization or something, but right now I have no idea.
that is pretty strange, have you ever grabbed the ethregs utility from us? it is an application you can build, that will dump all the device's registers. http://prdownloads.sf.net/e1000e/ethregs-1.4.1.tar.gz if you could run that we can compare against the configuration for a working kernel and see what might be misconfigured. I agree your current issue may just be the initialization of the RAR registers. Lets get in contact on monday and see if we can figure out where your code ended up different from the 2.6.30 driver.
bum link above: http://superb-west.dl.sourceforge.net/sourceforge/e1000/ethregs-1.4.1.tar.gz instead
I tried to track this down a bit more and have found something interesting. When running in ixgbe_clean_rx_irq() here: i = rx_ring->next_to_clean; rx_desc = IXGBE_RX_DESC_ADV(*rx_ring, i); staterr = le32_to_cpu(rx_desc->wb.upper.status_error); rx_buffer_info = &rx_ring->rx_buffer_info[i]; while (staterr & IXGBE_RXD_STAT_DD) { u32 upper_len = 0; if (*work_done >= work_to_do) break; (*work_done)++; I find that in arp requests and responses the staterr field is valid, but all other types of traffic staterr looks more like a pointer rather than having a value of 0x83 or 0x3 as one might expect. Very interesting....
Thanks, Jesse. I'll download that and give it a try. Right now, I'm rebuilding my upstream kernel on that box so I can test with 2.6.30 (which I'm sure will work fine). I should have some results soon.
Created attachment 346108 [details] eth2.tgz Jesse, here's a run of the ethregs utility from an upstream (linus's tree as of today) vs current the current 5.4 tree.
Created attachment 346115 [details] eth2-regs.tgz That last attachment was incorrect. Here is a correct one with the files needed.
Also the MAC address of the card in use here is: 00:1B:21:37:B7:20
FYI, I have installed a Intel 10Gb Mezz card in a blade Red Hat's Westford lab, ibm-hs22-01.lab.bos.redhat.com . When I boot the 2.6.18-151.el5.gtest.72 kernel I see what I think is the same behavior Andy described in comment 33. arp requests show up but I don't see replies from the other host. Regardless, ICMP doesn't work. I don't know if it helps you to have a second place to look at this issue but you are welcome to the ibm hs22 blade.
That's excellent, Peter. I've been testing something locally and I think I've found a fix after some help from Jesse B at Intel. I'll post a patch and a link to new test kernels when I have something I like.
My test kernels have been updated to include a patch for this bugzilla. http://people.redhat.com/agospoda/# Please test them and report back your results. Without immediate feedback there is a good chance this or any other fix for this driver will not be included in the upcoming update.
I loaded the gest73 kernel on the hs22 in Westford. The Intel 10GbE Mezz card with the ixgbe driver now responds to ICMP. I will do some more test work testing with it tomorrow. I've run out of time today. Unfortunately the only other 10Gb ethernet card I have access to here in Westford is a qla8xxx card running the qlge driver from mainline. That driver is pretty nascent. I don't have much confidence in it. I wish I could test the ixgbe against something I felt better about.
We did some testing on this ixgbe driver, we found it is mostly functional, but with some remaining bugs. 1. ethtool test failed 2. FC is disabled by default 3. no warnings for old EEPROM 4. unsupported SFP+ detection on Niantic does not seem to work (Intel only supports a limited set of SFP modules) 5. 82598 LOM (aka SFP+ 82598 LOM - also the Sun Mezz adapter) doesn't get link. 1) 2) I think both are already patched upstream 3) is fixed with just a quick check with a printk, it is to help us and users weed out pre-production adapters that might have link issues. 4) Intel will only support a limited set of SFP+ modules, code needs to be in the driver to enforce this correctly. 5) The upstream kernel driver currently has a link problem with 82598 SFP+ as well. It was a community patch (generic MDIO) that broke it, we are pretty close to fixing it. Once the patch for for 82598 SFP+ is pushed upstream, we will send the commit directly to RH to include it asap. Since we just got final adapter hardware (with new eeprom images and larger eeprom chips) a couple of weeks ago we have a few more code changes that should be included in RH5.4 to be complete.
*** Bug 438522 has been marked as a duplicate of this bug. ***
*** Bug 438520 has been marked as a duplicate of this bug. ***
Jesse, please confirm this for me: (In reply to comment #45) > We did some testing on this ixgbe driver, we found it is mostly functional, but > with some remaining bugs. > > 1. ethtool test failed > 2. FC is disabled by default > 3. no warnings for old EEPROM > 4. unsupported SFP+ detection on Niantic does not seem to work (Intel only > supports a limited set of SFP modules) > 5. 82598 LOM (aka SFP+ 82598 LOM - also the Sun Mezz adapter) doesn't get link. > > 1) 2) I think both are already patched upstream > 1) commit 4bebfaa56b72c94fe4997240ee73ad1c1fcf96c9 Author: Auke Kok <auke-jan.h.kok> Date: Mon Feb 11 09:26:01 2008 -0800 ixgbe: Disallow device reset during ethtool test 2) commit cd7664f69fe1f3f75b664503ae3e11a2971a4865 Author: Don Skidmore <donald.c.skidmore> Date: Tue Mar 31 21:33:44 2009 +0000 ixgbe: feature - driver to default with FC on. > 3) is fixed with just a quick check with a printk, it is to help us and users > weed out pre-production adapters that might have link issues. Can you elaborate on this a bit? > 4) Intel will only support a limited set of SFP+ modules, code needs to be in > the driver to enforce this correctly. Is there an upstream fix for this? > 5) The upstream kernel driver currently has a link problem with 82598 SFP+ as > well. It was a community patch (generic MDIO) that broke it, we are pretty > close to fixing it. Once the patch for for 82598 SFP+ is pushed upstream, we > will send the commit directly to RH to include it asap. OK, we are are getting *REALLY* close to the deadline here. > Since we just got final adapter hardware (with new eeprom images and larger > eeprom chips) a couple of weeks ago we have a few more code changes that should > be included in RH5.4 to be complete. I think I've got someone complaining to me about this already. Apparently probe fails when ixgbe_get_sfp_init_sequence_offsets reads IXGBE_PHY_INIT_OFFSET_NL and gets 0xffff. Is this what you are talking about?
(In reply to comment #48) My replies inline below: > Jesse, please confirm this for me: > (In reply to comment #45) > > We did some testing on this ixgbe driver, we found it is mostly functional, but > > with some remaining bugs. > > > > 1. ethtool test failed > > 2. FC is disabled by default > > 3. no warnings for old EEPROM > > 4. unsupported SFP+ detection on Niantic does not seem to work (Intel only > > supports a limited set of SFP modules) > > 5. 82598 LOM (aka SFP+ 82598 LOM - also the Sun Mezz adapter) doesn't get link. > > > > 1) 2) I think both are already patched upstream > > > 1) > commit 4bebfaa56b72c94fe4997240ee73ad1c1fcf96c9 > Author: Auke Kok <auke-jan.h.kok> > Date: Mon Feb 11 09:26:01 2008 -0800 > ixgbe: Disallow device reset during ethtool test > 2) > commit cd7664f69fe1f3f75b664503ae3e11a2971a4865 > Author: Don Skidmore <donald.c.skidmore> > Date: Tue Mar 31 21:33:44 2009 +0000 > ixgbe: feature - driver to default with FC on. > > 3) is fixed with just a quick check with a printk, it is to help us and users > > weed out pre-production adapters that might have link issues. > Can you elaborate on this a bit? Our 82599 SFP+ adapters (device 0x10fb) have an internal analog PHY that manages our SFI link to the SFP+ modules. In our older spins of the board, we found issues where the network link can be falsely indicated up and then down (flapping) when no cable was plugged into the NIC. To solve this, we added a firmware into the EEPROM which assists in PHY maintenance. To catch pre-production boards in the field, we added some code to read the EEPROM version, find the FW version, and if it's older than a certain rev, we display a warning to the system log. This fix isn't critical, so if it comes down to something more critical making it in, this can be dropped. > > 4) Intel will only support a limited set of SFP+ modules, code needs to be in > > the driver to enforce this correctly. > Is there an upstream fix for this? Yes. I can provide a patch if needed; it's the code surrounding the DEVICE_CAPS when referencing the EEPROM. It may be a bit difficult to extract it properly, since it also has FCoE goo in it. > > 5) The upstream kernel driver currently has a link problem with 82598 SFP+ as > > well. It was a community patch (generic MDIO) that broke it, we are pretty > > close to fixing it. Once the patch for for 82598 SFP+ is pushed upstream, we > > will send the commit directly to RH to include it asap. > OK, we are are getting *REALLY* close to the deadline here. Don just fixed this issue, and I know we have the fix locally. I will see if we can expediate the fix. As an alternative, you can roll back the MDIO changes that came from Ben Hutchings, and that will also fix this issue. Let me know what you'd prefer doing. > > Since we just got final adapter hardware (with new eeprom images and larger > > eeprom chips) a couple of weeks ago we have a few more code changes that should > > be included in RH5.4 to be complete. > I think I've got someone complaining to me about this already. Apparently > probe fails when ixgbe_get_sfp_init_sequence_offsets reads > IXGBE_PHY_INIT_OFFSET_NL and gets 0xffff. Is this what you are talking about? Wow, that is an OLD EEPROM! Yes, that's part of it. The 4 NICs that were just sent about 2 weeks ago have the final EEPROM on them, which also includes the FW I mentioned before. These boards are also physically shorter than the original boards you guys have. I would recommend using them if you can get your hands on them. We can try and get your older ones replaced as well, but I'm not sure we can replace them in the timeframe for RHEL5.4. Let me know what you'd like to do.
PJ, thanks for responding. In case you are not aware, the best place to grab the latest dev trees is here: http://people.redhat.com/dzickus/ and I maintain some kernels with networking patches here: http://people.redhat.com/agospoda/ I would suggest checking my test kernels to use as a basis for these patches. (In reply to comment #49) > (In reply to comment #48) > > My replies inline below: > > > Jesse, please confirm this for me: > > (In reply to comment #45) > > > We did some testing on this ixgbe driver, we found it is mostly functional, but > > > with some remaining bugs. > > > > > > 1. ethtool test failed > > > 2. FC is disabled by default > > > 3. no warnings for old EEPROM > > > 4. unsupported SFP+ detection on Niantic does not seem to work (Intel only > > > supports a limited set of SFP modules) > > > 5. 82598 LOM (aka SFP+ 82598 LOM - also the Sun Mezz adapter) doesn't get link. > > > > > > 1) 2) I think both are already patched upstream > > > > > 1) > > commit 4bebfaa56b72c94fe4997240ee73ad1c1fcf96c9 > > Author: Auke Kok <auke-jan.h.kok> > > Date: Mon Feb 11 09:26:01 2008 -0800 > > ixgbe: Disallow device reset during ethtool test > > 2) > > commit cd7664f69fe1f3f75b664503ae3e11a2971a4865 > > Author: Don Skidmore <donald.c.skidmore> > > Date: Tue Mar 31 21:33:44 2009 +0000 > > ixgbe: feature - driver to default with FC on. > > > 3) is fixed with just a quick check with a printk, it is to help us and users > > > weed out pre-production adapters that might have link issues. > > Can you elaborate on this a bit? > > Our 82599 SFP+ adapters (device 0x10fb) have an internal analog PHY that > manages our SFI link to the SFP+ modules. In our older spins of the board, we > found issues where the network link can be falsely indicated up and then down > (flapping) when no cable was plugged into the NIC. To solve this, we added a > firmware into the EEPROM which assists in PHY maintenance. > > To catch pre-production boards in the field, we added some code to read the > EEPROM version, find the FW version, and if it's older than a certain rev, we > display a warning to the system log. > > This fix isn't critical, so if it comes down to something more critical making > it in, this can be dropped. > > > > 4) Intel will only support a limited set of SFP+ modules, code needs to be in > > > the driver to enforce this correctly. > > Is there an upstream fix for this? > > Yes. I can provide a patch if needed; it's the code surrounding the > DEVICE_CAPS when referencing the EEPROM. It may be a bit difficult to extract > it properly, since it also has FCoE goo in it. > Anything you can do to help is great. The deadline is the end of *this* week, so we need to get this knocked out. If you can tell me the commit id for this upstream, I can take a look. > > > 5) The upstream kernel driver currently has a link problem with 82598 SFP+ as > > > well. It was a community patch (generic MDIO) that broke it, we are pretty > > > close to fixing it. Once the patch for for 82598 SFP+ is pushed upstream, we > > > will send the commit directly to RH to include it asap. > > OK, we are are getting *REALLY* close to the deadline here. > > Don just fixed this issue, and I know we have the fix locally. I will see if > we can expediate the fix. As an alternative, you can roll back the MDIO > changes that came from Ben Hutchings, and that will also fix this issue. Let > me know what you'd prefer doing. > We never took this fix: commit 6b73e10d2d89f9ce773f9b47d61b195936d059ba Author: Ben Hutchings <bhutchings> Date: Wed Apr 29 08:08:58 2009 +0000 ixgbe: Use generic MDIO definitions and functions so I don't see it as something we need to revert. :) > > > Since we just got final adapter hardware (with new eeprom images and larger > > > eeprom chips) a couple of weeks ago we have a few more code changes that should > > > be included in RH5.4 to be complete. > > I think I've got someone complaining to me about this already. Apparently > > probe fails when ixgbe_get_sfp_init_sequence_offsets reads > > IXGBE_PHY_INIT_OFFSET_NL and gets 0xffff. Is this what you are talking about? > > Wow, that is an OLD EEPROM! Yes, that's part of it. The 4 NICs that were just > sent about 2 weeks ago have the final EEPROM on them, which also includes the > FW I mentioned before. These boards are also physically shorter than the > original boards you guys have. I would recommend using them if you can get > your hands on them. We can try and get your older ones replaced as well, but > I'm not sure we can replace them in the timeframe for RHEL5.4. Let me know > what you'd like to do. That report came from someone at Stratus. Should I tell them to get some new cards?
> > > 1. ethtool test failed I did some more looking and the upstream ixgbe driver doesn't support the ethtool self_test op, so that's not much of a concern anymore. :)
(In reply to comment #51) > > > > 1. ethtool test failed > I did some more looking and the upstream ixgbe driver doesn't support the > ethtool self_test op, so that's not much of a concern anymore. :) It isn't upstream in net-2.6. I did get it upstream in net-next-2.6, but I don't see this as a need right now.
(In reply to comment #50) > PJ, thanks for responding. In case you are not aware, the best place to grab > the latest dev trees is here: > http://people.redhat.com/dzickus/ > and I maintain some kernels with networking patches here: > http://people.redhat.com/agospoda/ > I would suggest checking my test kernels to use as a basis for these patches. I'll get these pulled onto my local systems today > (In reply to comment #49) > > (In reply to comment #48) > > > > My replies inline below: > > > > > Jesse, please confirm this for me: > > > (In reply to comment #45) > > > > We did some testing on this ixgbe driver, we found it is mostly functional, but > > > > with some remaining bugs. > > > > > > > > 1. ethtool test failed > > > > 2. FC is disabled by default > > > > 3. no warnings for old EEPROM > > > > 4. unsupported SFP+ detection on Niantic does not seem to work (Intel only > > > > supports a limited set of SFP modules) > > > > 5. 82598 LOM (aka SFP+ 82598 LOM - also the Sun Mezz adapter) doesn't get link. > > > > > > > > 1) 2) I think both are already patched upstream > > > > > > > 1) > > > commit 4bebfaa56b72c94fe4997240ee73ad1c1fcf96c9 > > > Author: Auke Kok <auke-jan.h.kok> > > > Date: Mon Feb 11 09:26:01 2008 -0800 > > > ixgbe: Disallow device reset during ethtool test > > > 2) > > > commit cd7664f69fe1f3f75b664503ae3e11a2971a4865 > > > Author: Don Skidmore <donald.c.skidmore> > > > Date: Tue Mar 31 21:33:44 2009 +0000 > > > ixgbe: feature - driver to default with FC on. > > > > 3) is fixed with just a quick check with a printk, it is to help us and users > > > > weed out pre-production adapters that might have link issues. > > > Can you elaborate on this a bit? > > > > Our 82599 SFP+ adapters (device 0x10fb) have an internal analog PHY that > > manages our SFI link to the SFP+ modules. In our older spins of the board, we > > found issues where the network link can be falsely indicated up and then down > > (flapping) when no cable was plugged into the NIC. To solve this, we added a > > firmware into the EEPROM which assists in PHY maintenance. > > > > To catch pre-production boards in the field, we added some code to read the > > EEPROM version, find the FW version, and if it's older than a certain rev, we > > display a warning to the system log. > > > > This fix isn't critical, so if it comes down to something more critical making > > it in, this can be dropped. > > > > > > 4) Intel will only support a limited set of SFP+ modules, code needs to be in > > > > the driver to enforce this correctly. > > > Is there an upstream fix for this? > > > > Yes. I can provide a patch if needed; it's the code surrounding the > > DEVICE_CAPS when referencing the EEPROM. It may be a bit difficult to extract > > it properly, since it also has FCoE goo in it. > > > Anything you can do to help is great. The deadline is the end of *this* week, > so we need to get this knocked out. If you can tell me the commit id for this > upstream, I can take a look. I will get a concise list of commits and reply to this thread with them. I'll have them in a few hours for you. > > > > 5) The upstream kernel driver currently has a link problem with 82598 SFP+ as > > > > well. It was a community patch (generic MDIO) that broke it, we are pretty > > > > close to fixing it. Once the patch for for 82598 SFP+ is pushed upstream, we > > > > will send the commit directly to RH to include it asap. > > > OK, we are are getting *REALLY* close to the deadline here. > > > > Don just fixed this issue, and I know we have the fix locally. I will see if > > we can expediate the fix. As an alternative, you can roll back the MDIO > > changes that came from Ben Hutchings, and that will also fix this issue. Let > > me know what you'd prefer doing. > > > We never took this fix: > commit 6b73e10d2d89f9ce773f9b47d61b195936d059ba > Author: Ben Hutchings <bhutchings> > Date: Wed Apr 29 08:08:58 2009 +0000 > ixgbe: Use generic MDIO definitions and functions > so I don't see it as something we need to revert. :) Ok. I'll need to do some digging to figure out why the 82598 SFP+ devices are broken then. There's a ton of churn in the SFP+ code for the 82599 stuff, so I'll need to find what is wrong. > > > > Since we just got final adapter hardware (with new eeprom images and larger > > > > eeprom chips) a couple of weeks ago we have a few more code changes that should > > > > be included in RH5.4 to be complete. > > > I think I've got someone complaining to me about this already. Apparently > > > probe fails when ixgbe_get_sfp_init_sequence_offsets reads > > > IXGBE_PHY_INIT_OFFSET_NL and gets 0xffff. Is this what you are talking about? > > > > Wow, that is an OLD EEPROM! Yes, that's part of it. The 4 NICs that were just > > sent about 2 weeks ago have the final EEPROM on them, which also includes the > > FW I mentioned before. These boards are also physically shorter than the > > original boards you guys have. I would recommend using them if you can get > > your hands on them. We can try and get your older ones replaced as well, but > > I'm not sure we can replace them in the timeframe for RHEL5.4. Let me know > > what you'd like to do. > That report came from someone at Stratus. Should I tell them to get some new > cards? Yes. That EEPROM is based on the original EEPROM for our final hardware, which is a very old EEPROM. The current drivers won't be able to load (obviously from the error), so Stratus will need to request new NICs from their Intel rep.
(In reply to comment #53) > (In reply to comment #50) > > PJ, thanks for responding. In case you are not aware, the best place to grab > > the latest dev trees is here: > > http://people.redhat.com/dzickus/ > > and I maintain some kernels with networking patches here: > > http://people.redhat.com/agospoda/ > > I would suggest checking my test kernels to use as a basis for these patches. > I'll get these pulled onto my local systems today > > (In reply to comment #49) > > > (In reply to comment #48) > > > > > > My replies inline below: > > > > > > > Jesse, please confirm this for me: > > > > (In reply to comment #45) > > > > > We did some testing on this ixgbe driver, we found it is mostly functional, but > > > > > with some remaining bugs. > > > > > > > > > > 1. ethtool test failed > > > > > 2. FC is disabled by default > > > > > 3. no warnings for old EEPROM > > > > > 4. unsupported SFP+ detection on Niantic does not seem to work (Intel only > > > > > supports a limited set of SFP modules) > > > > > 5. 82598 LOM (aka SFP+ 82598 LOM - also the Sun Mezz adapter) doesn't get link. > > > > > > > > > > 1) 2) I think both are already patched upstream > > > > > > > > > 1) > > > > commit 4bebfaa56b72c94fe4997240ee73ad1c1fcf96c9 > > > > Author: Auke Kok <auke-jan.h.kok> > > > > Date: Mon Feb 11 09:26:01 2008 -0800 > > > > ixgbe: Disallow device reset during ethtool test > > > > 2) > > > > commit cd7664f69fe1f3f75b664503ae3e11a2971a4865 > > > > Author: Don Skidmore <donald.c.skidmore> > > > > Date: Tue Mar 31 21:33:44 2009 +0000 > > > > ixgbe: feature - driver to default with FC on. > > > > > 3) is fixed with just a quick check with a printk, it is to help us and users > > > > > weed out pre-production adapters that might have link issues. > > > > Can you elaborate on this a bit? > > > > > > Our 82599 SFP+ adapters (device 0x10fb) have an internal analog PHY that > > > manages our SFI link to the SFP+ modules. In our older spins of the board, we > > > found issues where the network link can be falsely indicated up and then down > > > (flapping) when no cable was plugged into the NIC. To solve this, we added a > > > firmware into the EEPROM which assists in PHY maintenance. > > > > > > To catch pre-production boards in the field, we added some code to read the > > > EEPROM version, find the FW version, and if it's older than a certain rev, we > > > display a warning to the system log. > > > > > > This fix isn't critical, so if it comes down to something more critical making > > > it in, this can be dropped. > > > > > > > > 4) Intel will only support a limited set of SFP+ modules, code needs to be in > > > > > the driver to enforce this correctly. > > > > Is there an upstream fix for this? > > > > > > Yes. I can provide a patch if needed; it's the code surrounding the > > > DEVICE_CAPS when referencing the EEPROM. It may be a bit difficult to extract > > > it properly, since it also has FCoE goo in it. > > > > > Anything you can do to help is great. The deadline is the end of *this* week, > > so we need to get this knocked out. If you can tell me the commit id for this > > upstream, I can take a look. > I will get a concise list of commits and reply to this thread with them. I'll > have them in a few hours for you. Here are the commits from Dave Miller's net-next-2.6 tree that you should refer to: commit aa5aec888585fedcda7cfffc20f75240ad1cb42d - ixgbe: Add semaphore access for PHY initialization for 82599 commit 1479ad4fbfbc801898dce1ac2d4d44f0c774ecc5 - ixgbe: Change the 82599 PHY DSP restart logic The above two are critical. The next one is optional, and can be left out at your discretion: commit 794caeb259bc5d341bcc80dd37820073147a231c - ixgbe: Add FW detection and warning for 82599 SFP+ adapters I haven't been able to find the commit that adds the get_device_caps() support, which is what we use to properly identify the SFP+ modules and reject modules that aren't on our whitelist. I will keep looking, but I may just say let's go with what we have for now. If I can't find it by EOD today, let's drop that request. > > > > > 5) The upstream kernel driver currently has a link problem with 82598 SFP+ as > > > > > well. It was a community patch (generic MDIO) that broke it, we are pretty > > > > > close to fixing it. Once the patch for for 82598 SFP+ is pushed upstream, we > > > > > will send the commit directly to RH to include it asap. > > > > OK, we are are getting *REALLY* close to the deadline here. > > > > > > Don just fixed this issue, and I know we have the fix locally. I will see if > > > we can expediate the fix. As an alternative, you can roll back the MDIO > > > changes that came from Ben Hutchings, and that will also fix this issue. Let > > > me know what you'd prefer doing. > > > > > We never took this fix: > > commit 6b73e10d2d89f9ce773f9b47d61b195936d059ba > > Author: Ben Hutchings <bhutchings> > > Date: Wed Apr 29 08:08:58 2009 +0000 > > ixgbe: Use generic MDIO definitions and functions > > so I don't see it as something we need to revert. :) > Ok. I'll need to do some digging to figure out why the 82598 SFP+ devices are > broken then. There's a ton of churn in the SFP+ code for the 82599 stuff, so > I'll need to find what is wrong. > > > > > Since we just got final adapter hardware (with new eeprom images and larger > > > > > eeprom chips) a couple of weeks ago we have a few more code changes that should > > > > > be included in RH5.4 to be complete. > > > > I think I've got someone complaining to me about this already. Apparently > > > > probe fails when ixgbe_get_sfp_init_sequence_offsets reads > > > > IXGBE_PHY_INIT_OFFSET_NL and gets 0xffff. Is this what you are talking about? > > > > > > Wow, that is an OLD EEPROM! Yes, that's part of it. The 4 NICs that were just > > > sent about 2 weeks ago have the final EEPROM on them, which also includes the > > > FW I mentioned before. These boards are also physically shorter than the > > > original boards you guys have. I would recommend using them if you can get > > > your hands on them. We can try and get your older ones replaced as well, but > > > I'm not sure we can replace them in the timeframe for RHEL5.4. Let me know > > > what you'd like to do. > > That report came from someone at Stratus. Should I tell them to get some new > > cards? > Yes. That EEPROM is based on the original EEPROM for our final hardware, which > is a very old EEPROM. The current drivers won't be able to load (obviously > from the error), so Stratus will need to request new NICs from their Intel rep.
Update: we re-tested the report that the 82598 SFP+ devices were not linking with the RHEL5.4 driver. The symptom is that the link intermittently takes a long time to come up (up to 10 seconds), but once it's up, it's stable. Other times the link immediately comes up. However, in none of the testing does the link fail to come online. We will continue to test this scenario, but at this point we are treating this as a low priority issue, and recommend going forward with no additional driver changes for this issue. If RH wants to see a fix for this, please advise.
PJ, my test kernels have been updates with the code that we plan to ship for 5.4. Could you or someone else help verify them? We realize that based on list in comment #45 there are still some outstanding issues: 1. ethtool test failed but I'm not sure if these have been resolved or if we can live with these or any additional problems. 4. unsupported SFP+ detection on Niantic does not seem to work (Intel only supports a limited set of SFP modules) 5. 82598 LOM (aka SFP+ 82598 LOM - also the Sun Mezz adapter) doesn't get link. Thanks for all the help on this.
(In reply to comment #57) > PJ, my test kernels have been updates with the code that we plan to ship for > 5.4. Could you or someone else help verify them? We realize that based on > list in comment #45 there are still some outstanding issues: > 1. ethtool test failed > but I'm not sure if these have been resolved or if we can live with these or > any additional problems. The ethtool test failing is fine. We've never had ethtool test support in ixgbe until very recently in Dave Miller's net-next-2.6. So not having it here is not a problem. > 4. unsupported SFP+ detection on Niantic does not seem to work (Intel only > supports a limited set of SFP modules) We can let this one go. The good thing is SFP+ modules are functional at this point; I'd be worried if supported modules weren't working. > 5. 82598 LOM (aka SFP+ 82598 LOM - also the Sun Mezz adapter) doesn't get link. We've retested this and it's not that it won't get link, it just takes a long time to get link. It's intermittent though, but link will always come up. This is a non-issue at this point. > Thanks for all the help on this. You bet. We should be in good shape from here. Your 74 kernel was tested and given the green light yesterday.
~~ Attention - RHEL 5.4 Beta Released! ~~ RHEL 5.4 Beta has been released! There should be a fix present in the Beta release that addresses this particular request. Please test and report back results here, at your earliest convenience. RHEL 5.4 General Availability release is just around the corner! If you encounter any issues while testing Beta, please describe the issues you have encountered and set the bug into NEED_INFO. If you encounter new issues, please clone this bug to open a new issue and request it be reviewed for inclusion in RHEL 5.4 or a later update, if it is not of urgent severity. Please do not flip the bug status to VERIFIED. Only post your verification results, and if available, update Verified field with the appropriate value. Questions can be posted to this bug or your customer or partner representative.
During the testing of the I/OAT code it was found that the 5.4 ixgbe driver does not have the end-point DCA code change in it. As such only DMA testing was possible with the included ixgbe driver. We would like to be able to test end-point DCA but can't at this point. Support has been upstream for some time now. Please advise.
(In reply to comment #62) > During the testing of the I/OAT code it was found that the 5.4 ixgbe driver > does not have the end-point DCA code change in it. As such only DMA testing > was possible with the included ixgbe driver. We would like to be able to test > end-point DCA but can't at this point. Support has been upstream for some time > now. Please advise. The ixgbe driver originally did not any of the DCA bits as they were upstream, but not in RHEL5. This has changed in RHEL5.4 as DCA was added, but I do not anticipate support for DCA in ixgbe until RHEL5.5.
I think that some of the OEM's (from Japan, NEC, FSC and Hitachi) are looking for this. Is it the right decision? Do they know?
(In reply to comment #64) > I think that some of the OEM's (from Japan, NEC, FSC and Hitachi) are looking > for this. Is it the right decision? Do they know? I'm not sure. I guess we will find out! :-)
~~ Attention Partners - RHEL 5.4 Snapshot 1 Released! ~~ RHEL 5.4 Snapshot 1 has been released on partners.redhat.com. If you have already reported your test results, you can safely ignore this request. Otherwise, please notice that there should be a fix available now that addresses this particular request. Please test and report back your results here, at your earliest convenience. The RHEL 5.4 exception freeze is quickly approaching. If you encounter any issues while testing Beta, please describe the issues you have encountered and set the bug into NEED_INFO. If you encounter new issues, please clone this bug to open a new issue and request it be reviewed for inclusion in RHEL 5.4 or a later update, if it is not of urgent severity. Do not flip the bug status to VERIFIED. Instead, please set your Partner ID in the Verified field above if you have successfully verified the resolution of this issue. Further questions can be directed to your Red Hat Partner Manager or other appropriate customer representative.
Patches are in -158.el5 kernel. SanityOnly.
For no such Intel?? 82599 10GbE controller (formerly codenamed ???Niantic???) in RHTS, we only check patches sanity, and Jan have finished that work.
The next snapshot will be tested on Niantic NICs. So the only thing left would be to report the outcome of this testing. We'll do this after the next snap is available.
~~ Attention Partners - RHEL 5.4 Snapshot 5 Released! ~~ RHEL 5.4 Snapshot 5 is the FINAL snapshot to be release before RC. It has been released on partners.redhat.com. If you have already reported your test results, you can safely ignore this request. Otherwise, please notice that there should be a fix available now that addresses this particular issue. Please test and report back your results here, at your earliest convenience. If you encounter any issues while testing Beta, please describe the issues you have encountered and set the bug into NEED_INFO. If you encounter new issues, please clone this bug to open a new issue and request it be reviewed for inclusion in RHEL 5.4 or a later update, if it is not of urgent severity. If it is urgent, escalate the issue to your partner manager as soon as possible. There is /very/ little time left to get additional code into 5.4 before GA. Partners, after you have verified, do not flip the bug status to VERIFIED. Instead, please set your Partner ID in the Verified field above if you have successfully verified the resolution of this issue. Further questions can be directed to your Red Hat Partner Manager or other appropriate customer representative.
John, any updates on testing?
Work continues on the FC issues. We will know more in the next few days.
Intel, any updates?
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2009-1243.html
John, Intel?
The FC issues still need to be worked on for 5.5. It is working upstream and in our stand-alone driver version so it's a merge thing somehow. Also, the end-point DCA support has been moved to a separate BZ, 514306 so we should be covered there.
We would like to see RH take a look at the FC problem. It has to be a merge thing where either a patch didn't get applied or one got backported incorrectly. Since this is working both upstream and in our stand-alone driver, it has to be a backport/merge issue.
John, I will make sure the FC problems are addressed for 5.5.