Bug 472547 - [RHEL5.4 FEAT] Update ixgbe to version 2.0.8-k2 and support the 82599 (Niantic) device
[RHEL5.4 FEAT] Update ixgbe to version 2.0.8-k2 and support the 82599 (Nianti...
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel (Show other bugs)
5.4
All Linux
high Severity high
: rc
: 5.4
Assigned To: Andy Gospodarek
Red Hat Kernel QE team
: FutureFeature, HardwareEnablement, OtherQA
: 438520 438522 438523 475580 (view as bug list)
Depends On: 505653
Blocks: 504506 504507 504615 504669 450783 460949 483701 483784 488646 507625 511206
  Show dependency treegraph
 
Reported: 2008-11-21 12:23 EST by John Ronciak
Modified: 2009-09-09 14:03 EDT (History)
29 users (show)

See Also:
Fixed In Version:
Doc Type: Enhancement
Doc Text:
Story Points: ---
Clone Of:
: 505653 (view as bug list)
Environment:
Last Closed: 2009-09-02 04:13:58 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
eth2.tgz (32.83 KB, application/x-gzip)
2009-06-01 14:07 EDT, Andy Gospodarek
no flags Details
eth2-regs.tgz (32.97 KB, application/x-gzip)
2009-06-01 14:29 EDT, Andy Gospodarek
no flags Details

  None (edit)
Description John Ronciak 2008-11-21 12:23:55 EST
Description of problem:
Adding device support

This is a request from IBM to have Niantic support in 5.4.
Comment 1 Martin Wilck 2008-12-08 13:07:12 EST
FSC requests this for 5.4, too.
Comment 2 John Jarvis 2009-01-27 13:20:51 EST
*** Bug 475580 has been marked as a duplicate of this bug. ***
Comment 8 RHEL Product and Program Management 2009-02-16 10:14:38 EST
Updating PM score.
Comment 10 IBM Bug Proxy 2009-02-16 11:43:14 EST
=Comment: #0=================================================
Emily J. Ratliff <ratliff@austin.ibm.com> -
1. Feature Overview:
Feature Id:	[201288]
a. Name of Feature:	Driver update for Intel 10GB - ixgbe
b. Feature Description
Driver updates to support the Intel 10GB NICS.  The drivers are called ixgbe and ixgb.

Additional Comments:	We require that the ixgbe driver be updated to support the Intel Niantic
(Dorado) 10GB NIC.

2. Feature Details:
Sponsor:	xSeries
Architectures:
x86
x86_64

Arch Specificity: Purely Common Code
Affects Kernel Modules: Yes
Delivery Mechanism: Direct from community
Category:	Kernel
Request Type:	Driver - Update Version
d. Upstream Acceptance:	Accepted
Sponsor Priority	1
f. Severity: High
IBM Confidential:	yes
Code Contribution:	no
g. Component Version Target:	2.6.24

3. Business Case
Future option support of Intel 10GB adapter will be available on several systems and blades.  These
drivers need to be updated to support the high speed adapters.

4. Primary contact at Red Hat:
John Jarvis
jjarvis@redhat.com

5. Primary contacts at Partner:
Project Management Contact:
Monte Knutson, mknutson@us.ibm.com, 877-894-1495

Technical contact(s):
Kevin Stansell, kstansel@us.ibm.com
Chris McDermott, mcdermoc@us.ibm.com

IBM Manager:
Julio Alvarez, julioa@us.ibm.com
IBM is signed up to test and provide feedback.
*** This bug has been marked as a duplicate of 472547 ***
Comment 11 Andrius Benokraitis 2009-02-18 21:42:24 EST
Gospo - is this the BZ being used for the wholesale ixgbe driver update in 5.4?
Comment 12 Andy Gospodarek 2009-02-18 22:51:37 EST
Yep, seems like it.
Comment 13 Ronald Pacheco 2009-03-06 06:29:54 EST
*** Bug 438523 has been marked as a duplicate of this bug. ***
Comment 18 Andy Gospodarek 2009-04-22 22:40:21 EDT
My test kernels have been updated to include a patch for this bugzilla.

http://people.redhat.com/agospoda/#rhel5

Please test them and report back your results.  Without immediate
feedback there is a good chance this or any other fix for this driver
will not be included in the upcoming update.
Comment 20 John Ronciak 2009-04-27 19:13:06 EDT
The driver loads (ID are there) on -70 kernel but it only passes a few packets before it hangs.  We tried both our Spring Fountain NIC's (SFP+ type NIC with direct attach cables and the IBM Windley Key NICs (KX4).  They all did the same thing.
Comment 21 Keve Gabbert 2009-04-28 18:22:47 EDT
has RH tested with the IBM Windley Key NICs ?
Comment 22 Andy Gospodarek 2009-04-29 09:32:42 EDT
(In reply to comment #20)
> The driver loads (ID are there) on -70 kernel but it only passes a few packets
> before it hangs.  We tried both our Spring Fountain NIC's (SFP+ type NIC with
> direct attach cables and the IBM Windley Key NICs (KX4).  They all did the same
> thing.

I tested this on the lone ixgbe-based NIC that I have locally.  It's a dual port CX4 82598 and it seemed to work fine when I ran netperf on it for a while (I don't remember how long) using both msi-x and legacy interrupts without any issue.

When you see the 'hang' does the kernel hang or does the network interface just stop processing frames?  

I took another look at my backport and noticed a potential problem in the ixgbe_clean_rxonly_many function that could cause some problems that I would probably not have seen since I was using a system with less cores than yours and wouldn't have to deal with the vector overlap.  The patch looks like this:

diff --git a/drivers/net/ixgbe/ixgbe_main.c b/drivers/net/ixgbe/ixgbe_main.c
index ebf3578..5f271e1 100644
--- a/drivers/net/ixgbe/ixgbe_main.c
+++ b/drivers/net/ixgbe/ixgbe_main.c
@@ -1109,7 +1109,7 @@ static int ixgbe_clean_rxonly_many(struct net_device *netdev, int *budget)
        /* If all Rx work done, exit the polling mode */
        if ((work_done < work_to_do) || !netif_running(adapter->netdev)) {
 quit_polling:
-               netif_rx_complete(adapter->netdev);
+               netif_rx_complete(netdev);
                if (adapter->itr_setting & 1)
                        ixgbe_set_itr_msix(q_vector);
                if (!test_bit(__IXGBE_DOWN, &adapter->state))

I'll apply that fix to my test kernels and get you some new ones.

(In reply to comment #21)
> has RH tested with the IBM Windley Key NICs ?  

Nope, we haven't.  In the entire company I believe we have 2 ixgbe-based NICs.
Comment 23 John Ronciak 2009-04-29 11:40:09 EDT
Thanks Andy, we'll test the kernel as soon as you get it generated.

>> has RH tested with the IBM Windley Key NICs ?  
>Nope, we haven't.  In the entire company I believe we have 2 ixgbe-based NICs.  
RH was given 4 of these NICs back in March.  They are for the IBM Blade Center systems.  On the Engr call yesterday Peter M. reported that testing was under way with them.  I guess since it would need this driver that it is not actually under way yet.  There was a problem with the actual system(s) but IBM got that resolved.  So there are Niantic NICs in Westford.
Comment 25 John Jarvis 2009-04-30 09:52:12 EDT
This enhancement request was evaluated by the full Red Hat Enterprise Linux 
team for inclusion in a Red Hat Enterprise Linux minor release.   As a 
result of this evaluation, Red Hat has tentatively approved inclusion of 
this feature in the next Red Hat Enterprise Linux Update minor release.   
While it is a goal to include this enhancement in the next minor release 
of Red Hat Enterprise Linux, the enhancement is not yet committed for 
inclusion in the next minor release pending the next phase of actual 
code integration and successful Red Hat and partner testing.
Comment 26 Don Zickus 2009-05-06 13:14:57 EDT
in kernel-2.6.18-144.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Please do NOT transition this bugzilla state to VERIFIED until our QE team
has sent specific instructions indicating when to do so.  However feel free
to provide a comment indicating that this fix has been verified.
Comment 28 Andrius Benokraitis 2009-05-11 10:44:29 EDT
Gospo - what's the version ixgbe that was committed to the 5.4 so far?
Comment 29 Andy Gospodarek 2009-05-11 15:32:51 EDT
Andrius, it's version 2.0.8-k2.
Comment 30 John Ronciak 2009-05-11 18:17:57 EDT
Andy,

>BAT on ixgbe failed for Niantic (I tested on Spring Fountain). 
>Driver is unable to ping. Oplin passed.

The Spring Fountain NIC is Niantic on a PCIe board.  The same as the Windley Key IBM mezz cards that RH already has.  Since Oplin is passing, there is something specific to the backport of the Niantic code.  This was on the -144 kernel called out above.
Comment 31 John Ronciak 2009-05-19 19:00:54 EDT
This commit also need to make the 5.4 ixgbe driver.

-----------------------------------
From: netdev-owner@vger.kernel.org [mailto:netdev-owner@vger.kernel.org] On Behalf Of David Miller
Sent: Tuesday, May 19, 2009 2:41 PM
To: Kirsher, Jeffrey T
Cc: netdev@vger.kernel.org; Waskiewicz Jr, Peter P
Subject: Re: [net-next-2.6 PATCH 1/3] ixgbe: Add semaphore access for PHY initialization for 82599

From: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Date: Tue, 19 May 2009 12:18:34 -0700

> The SFP+ NIC (device id 0x10fb) needs a semaphore to serialize
> PHY access, so our PHY init code must honor that same semaphore.
> 
> Signed-off-by: Peter P Waskiewicz Jr <peter.p.waskiewicz.jr@intel.com>
> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

Applied.
Comment 32 Andy Gospodarek 2009-05-20 21:19:08 EDT
My test kernels have been updated to include a patch for this bugzilla.

http://people.redhat.com/agospoda/#rhel5

Please test them and report back your results.  Without immediate
feedback there is a good chance this or any other fix for this driver
will not be included in the upcoming update.
Comment 33 Andy Gospodarek 2009-05-29 16:29:05 EDT
I got some cards, and have been testing them today with some interesting results.  It seems that arping works fine (broadcast or unicast), but I cannot ping with the device at all.  It's really odd.  I'm guessing it has something to do with the mac address initialization or something, but right now I have no idea.
Comment 34 Jesse Brandeburg 2009-05-29 17:55:11 EDT
that is pretty strange, have you ever grabbed the ethregs utility from us?  it is an application you can build, that will dump all the device's registers.

http://prdownloads.sf.net/e1000e/ethregs-1.4.1.tar.gz

if you could run that we can compare against the configuration for a working kernel and see what might be misconfigured.

I agree your current issue may just be the initialization of the RAR registers.  Lets get in contact on monday and see if we can figure out where your code ended up different from the 2.6.30 driver.
Comment 35 Jesse Brandeburg 2009-05-29 17:56:35 EDT
bum link above: 
http://superb-west.dl.sourceforge.net/sourceforge/e1000/ethregs-1.4.1.tar.gz

instead
Comment 36 Andy Gospodarek 2009-05-29 17:59:51 EDT
I tried to track this down a bit more and have found something interesting.

When running in ixgbe_clean_rx_irq() here:

        i = rx_ring->next_to_clean;
        rx_desc = IXGBE_RX_DESC_ADV(*rx_ring, i);
        staterr = le32_to_cpu(rx_desc->wb.upper.status_error);
        rx_buffer_info = &rx_ring->rx_buffer_info[i];


        while (staterr & IXGBE_RXD_STAT_DD) {
                u32 upper_len = 0;
                if (*work_done >= work_to_do)
                        break;
                (*work_done)++;

I find that in arp requests and responses the staterr field is valid, but all other types of traffic staterr looks more like a pointer rather than having a value of 0x83 or 0x3 as one might expect.  Very interesting....
Comment 37 Andy Gospodarek 2009-06-01 12:55:30 EDT
Thanks, Jesse.  I'll download that and give it a try.

Right now, I'm rebuilding my upstream kernel on that box so I can test with 2.6.30 (which I'm sure will work fine).  I should have some results soon.
Comment 38 Andy Gospodarek 2009-06-01 14:07:15 EDT
Created attachment 346108 [details]
eth2.tgz

Jesse, here's a run of the ethregs utility from an upstream (linus's tree as of today) vs current the current 5.4 tree.
Comment 39 Andy Gospodarek 2009-06-01 14:29:59 EDT
Created attachment 346115 [details]
eth2-regs.tgz

That last attachment was incorrect.  Here is a correct one with the files needed.
Comment 40 Andy Gospodarek 2009-06-01 14:31:52 EDT
Also the MAC address of the card in use here is: 00:1B:21:37:B7:20
Comment 41 Peter Bogdanovic 2009-06-03 15:47:39 EDT
FYI, I have installed a Intel 10Gb Mezz card in a blade Red Hat's Westford lab, ibm-hs22-01.lab.bos.redhat.com . When I boot the 2.6.18-151.el5.gtest.72 kernel I see what I think is the same behavior Andy described in comment 33. arp requests show up but I don't see replies from the other host. Regardless, ICMP doesn't work.

I don't know if it helps you to have a second place to look at this issue but you are welcome to the ibm hs22 blade.
Comment 42 Andy Gospodarek 2009-06-03 16:29:30 EDT
That's excellent, Peter.  I've been testing something locally and I think I've found a fix after some help from Jesse B at Intel.  I'll post a patch and a link to new test kernels when I have something I like.
Comment 43 Andy Gospodarek 2009-06-04 11:31:46 EDT
My test kernels have been updated to include a patch for this bugzilla.

http://people.redhat.com/agospoda/#

Please test them and report back your results.  Without immediate
feedback there is a good chance this or any other fix for this driver
will not be included in the upcoming update.
Comment 44 Peter Bogdanovic 2009-06-04 18:49:43 EDT
I loaded the gest73 kernel on the hs22 in Westford. The Intel 10GbE Mezz card with the ixgbe driver now responds to ICMP. I will do some more test work testing with it tomorrow.  I've run out of time today.

Unfortunately the only other 10Gb ethernet card I have access to here in Westford is a qla8xxx card running the qlge driver from mainline. That driver is pretty nascent. I don't have much confidence in it. I wish I could test the ixgbe against something I felt better about.
Comment 45 Jesse Brandeburg 2009-06-05 16:53:56 EDT
We did some testing on this ixgbe driver, we found it is mostly functional, but with some remaining bugs.

1. ethtool test failed 
2. FC is disabled by default
3. no warnings for old EEPROM
4. unsupported SFP+ detection on Niantic does not seem to work (Intel only supports a limited set of SFP modules)
5. 82598 LOM (aka SFP+ 82598 LOM - also the Sun Mezz adapter) doesn't get link.

1) 2) I think both are already patched upstream

3) is fixed with just a quick check with a printk, it is to help us and users weed out pre-production adapters that might have link issues.

4) Intel will only support a limited set of SFP+ modules, code needs to be in the driver to enforce this correctly.

5) The upstream kernel driver currently has a link problem with 82598 SFP+ as well.  It was a community patch (generic MDIO) that broke it, we are pretty close to fixing it. Once the patch for for 82598 SFP+ is pushed upstream, we will send the commit directly to RH to include it asap.  

Since we just got final adapter hardware (with new eeprom images and larger eeprom chips) a couple of weeks ago we have a few more code changes that should be included in RH5.4 to be complete.
Comment 46 Ronald Pacheco 2009-06-05 19:18:47 EDT
*** Bug 438522 has been marked as a duplicate of this bug. ***
Comment 47 Andy Gospodarek 2009-06-07 20:46:17 EDT
*** Bug 438520 has been marked as a duplicate of this bug. ***
Comment 48 Andy Gospodarek 2009-06-08 17:27:14 EDT
Jesse, please confirm this for me:

(In reply to comment #45)
> We did some testing on this ixgbe driver, we found it is mostly functional, but
> with some remaining bugs.
> 
> 1. ethtool test failed 
> 2. FC is disabled by default
> 3. no warnings for old EEPROM
> 4. unsupported SFP+ detection on Niantic does not seem to work (Intel only
> supports a limited set of SFP modules)
> 5. 82598 LOM (aka SFP+ 82598 LOM - also the Sun Mezz adapter) doesn't get link.
> 
> 1) 2) I think both are already patched upstream
> 

1)

commit 4bebfaa56b72c94fe4997240ee73ad1c1fcf96c9
Author: Auke Kok <auke-jan.h.kok@intel.com>
Date:   Mon Feb 11 09:26:01 2008 -0800

    ixgbe: Disallow device reset during ethtool test

2) 

commit cd7664f69fe1f3f75b664503ae3e11a2971a4865
Author: Don Skidmore <donald.c.skidmore@intel.com>
Date:   Tue Mar 31 21:33:44 2009 +0000

    ixgbe: feature - driver to default with FC on.


> 3) is fixed with just a quick check with a printk, it is to help us and users
> weed out pre-production adapters that might have link issues.

Can you elaborate on this a bit?
 
> 4) Intel will only support a limited set of SFP+ modules, code needs to be in
> the driver to enforce this correctly.

Is there an upstream fix for this?

> 5) The upstream kernel driver currently has a link problem with 82598 SFP+ as
> well.  It was a community patch (generic MDIO) that broke it, we are pretty
> close to fixing it. Once the patch for for 82598 SFP+ is pushed upstream, we
> will send the commit directly to RH to include it asap.  

OK, we are are getting *REALLY* close to the deadline here.

> Since we just got final adapter hardware (with new eeprom images and larger
> eeprom chips) a couple of weeks ago we have a few more code changes that should
> be included in RH5.4 to be complete.  

I think I've got someone complaining to me about this already.  Apparently probe fails when ixgbe_get_sfp_init_sequence_offsets reads IXGBE_PHY_INIT_OFFSET_NL and gets 0xffff.  Is this what you are talking about?
Comment 49 PJ Waskiewicz 2009-06-08 19:10:04 EDT
(In reply to comment #48)

My replies inline below:

> Jesse, please confirm this for me:
> (In reply to comment #45)
> > We did some testing on this ixgbe driver, we found it is mostly functional, but
> > with some remaining bugs.
> > 
> > 1. ethtool test failed 
> > 2. FC is disabled by default
> > 3. no warnings for old EEPROM
> > 4. unsupported SFP+ detection on Niantic does not seem to work (Intel only
> > supports a limited set of SFP modules)
> > 5. 82598 LOM (aka SFP+ 82598 LOM - also the Sun Mezz adapter) doesn't get link.
> > 
> > 1) 2) I think both are already patched upstream
> > 
> 1)
> commit 4bebfaa56b72c94fe4997240ee73ad1c1fcf96c9
> Author: Auke Kok <auke-jan.h.kok@intel.com>
> Date:   Mon Feb 11 09:26:01 2008 -0800
>     ixgbe: Disallow device reset during ethtool test
> 2) 
> commit cd7664f69fe1f3f75b664503ae3e11a2971a4865
> Author: Don Skidmore <donald.c.skidmore@intel.com>
> Date:   Tue Mar 31 21:33:44 2009 +0000
>     ixgbe: feature - driver to default with FC on.
> > 3) is fixed with just a quick check with a printk, it is to help us and users
> > weed out pre-production adapters that might have link issues.
> Can you elaborate on this a bit?

Our 82599 SFP+ adapters (device 0x10fb) have an internal analog PHY that manages our SFI link to the SFP+ modules.  In our older spins of the board, we found issues where the network link can be falsely indicated up and then down (flapping) when no cable was plugged into the NIC.  To solve this, we added a firmware into the EEPROM which assists in PHY maintenance.

To catch pre-production boards in the field, we added some code to read the EEPROM version, find the FW version, and if it's older than a certain rev, we display a warning to the system log.

This fix isn't critical, so if it comes down to something more critical making it in, this can be dropped.

> > 4) Intel will only support a limited set of SFP+ modules, code needs to be in
> > the driver to enforce this correctly.
> Is there an upstream fix for this?

Yes.  I can provide a patch if needed; it's the code surrounding the DEVICE_CAPS when referencing the EEPROM.  It may be a bit difficult to extract it properly, since it also has FCoE goo in it.

> > 5) The upstream kernel driver currently has a link problem with 82598 SFP+ as
> > well.  It was a community patch (generic MDIO) that broke it, we are pretty
> > close to fixing it. Once the patch for for 82598 SFP+ is pushed upstream, we
> > will send the commit directly to RH to include it asap.  
> OK, we are are getting *REALLY* close to the deadline here.

Don just fixed this issue, and I know we have the fix locally.  I will see if we can expediate the fix.  As an alternative, you can roll back the MDIO changes that came from Ben Hutchings, and that will also fix this issue.  Let me know what you'd prefer doing.

> > Since we just got final adapter hardware (with new eeprom images and larger
> > eeprom chips) a couple of weeks ago we have a few more code changes that should
> > be included in RH5.4 to be complete.  
> I think I've got someone complaining to me about this already.  Apparently
> probe fails when ixgbe_get_sfp_init_sequence_offsets reads
> IXGBE_PHY_INIT_OFFSET_NL and gets 0xffff.  Is this what you are talking about?  

Wow, that is an OLD EEPROM!  Yes, that's part of it.  The 4 NICs that were just sent about 2 weeks ago have the final EEPROM on them, which also includes the FW I mentioned before.  These boards are also physically shorter than the original boards you guys have.  I would recommend using them if you can get your hands on them.  We can try and get your older ones replaced as well, but I'm not sure we can replace them in the timeframe for RHEL5.4.  Let me know what you'd like to do.
Comment 50 Andy Gospodarek 2009-06-10 12:02:00 EDT
PJ, thanks for responding.  In case you are not aware, the best place to grab the latest dev trees is here:

http://people.redhat.com/dzickus/

and I maintain some kernels with networking patches here:

http://people.redhat.com/agospoda/

I would suggest checking my test kernels to use as a basis for these patches.

(In reply to comment #49)
> (In reply to comment #48)
> 
> My replies inline below:
> 
> > Jesse, please confirm this for me:
> > (In reply to comment #45)
> > > We did some testing on this ixgbe driver, we found it is mostly functional, but
> > > with some remaining bugs.
> > > 
> > > 1. ethtool test failed 
> > > 2. FC is disabled by default
> > > 3. no warnings for old EEPROM
> > > 4. unsupported SFP+ detection on Niantic does not seem to work (Intel only
> > > supports a limited set of SFP modules)
> > > 5. 82598 LOM (aka SFP+ 82598 LOM - also the Sun Mezz adapter) doesn't get link.
> > > 
> > > 1) 2) I think both are already patched upstream
> > > 
> > 1)
> > commit 4bebfaa56b72c94fe4997240ee73ad1c1fcf96c9
> > Author: Auke Kok <auke-jan.h.kok@intel.com>
> > Date:   Mon Feb 11 09:26:01 2008 -0800
> >     ixgbe: Disallow device reset during ethtool test
> > 2) 
> > commit cd7664f69fe1f3f75b664503ae3e11a2971a4865
> > Author: Don Skidmore <donald.c.skidmore@intel.com>
> > Date:   Tue Mar 31 21:33:44 2009 +0000
> >     ixgbe: feature - driver to default with FC on.
> > > 3) is fixed with just a quick check with a printk, it is to help us and users
> > > weed out pre-production adapters that might have link issues.
> > Can you elaborate on this a bit?
> 
> Our 82599 SFP+ adapters (device 0x10fb) have an internal analog PHY that
> manages our SFI link to the SFP+ modules.  In our older spins of the board, we
> found issues where the network link can be falsely indicated up and then down
> (flapping) when no cable was plugged into the NIC.  To solve this, we added a
> firmware into the EEPROM which assists in PHY maintenance.
> 
> To catch pre-production boards in the field, we added some code to read the
> EEPROM version, find the FW version, and if it's older than a certain rev, we
> display a warning to the system log.
> 
> This fix isn't critical, so if it comes down to something more critical making
> it in, this can be dropped.
> 
> > > 4) Intel will only support a limited set of SFP+ modules, code needs to be in
> > > the driver to enforce this correctly.
> > Is there an upstream fix for this?
> 
> Yes.  I can provide a patch if needed; it's the code surrounding the
> DEVICE_CAPS when referencing the EEPROM.  It may be a bit difficult to extract
> it properly, since it also has FCoE goo in it.
> 

Anything you can do to help is great.  The deadline is the end of *this* week, so we need to get this knocked out.  If you can tell me the commit id for this upstream, I can take a look.

> > > 5) The upstream kernel driver currently has a link problem with 82598 SFP+ as
> > > well.  It was a community patch (generic MDIO) that broke it, we are pretty
> > > close to fixing it. Once the patch for for 82598 SFP+ is pushed upstream, we
> > > will send the commit directly to RH to include it asap.  
> > OK, we are are getting *REALLY* close to the deadline here.
> 
> Don just fixed this issue, and I know we have the fix locally.  I will see if
> we can expediate the fix.  As an alternative, you can roll back the MDIO
> changes that came from Ben Hutchings, and that will also fix this issue.  Let
> me know what you'd prefer doing.
> 

We never took this fix:

commit 6b73e10d2d89f9ce773f9b47d61b195936d059ba
Author: Ben Hutchings <bhutchings@solarflare.com>
Date:   Wed Apr 29 08:08:58 2009 +0000

    ixgbe: Use generic MDIO definitions and functions

so I don't see it as something we need to revert. :)

> > > Since we just got final adapter hardware (with new eeprom images and larger
> > > eeprom chips) a couple of weeks ago we have a few more code changes that should
> > > be included in RH5.4 to be complete.  
> > I think I've got someone complaining to me about this already.  Apparently
> > probe fails when ixgbe_get_sfp_init_sequence_offsets reads
> > IXGBE_PHY_INIT_OFFSET_NL and gets 0xffff.  Is this what you are talking about?  
> 
> Wow, that is an OLD EEPROM!  Yes, that's part of it.  The 4 NICs that were just
> sent about 2 weeks ago have the final EEPROM on them, which also includes the
> FW I mentioned before.  These boards are also physically shorter than the
> original boards you guys have.  I would recommend using them if you can get
> your hands on them.  We can try and get your older ones replaced as well, but
> I'm not sure we can replace them in the timeframe for RHEL5.4.  Let me know
> what you'd like to do.  

That report came from someone at Stratus.  Should I tell them to get some new cards?
Comment 51 Andy Gospodarek 2009-06-10 12:45:54 EDT
> > > 1. ethtool test failed 

I did some more looking and the upstream ixgbe driver doesn't support the ethtool self_test op, so that's not much of a concern anymore. :)
Comment 52 PJ Waskiewicz 2009-06-10 13:12:41 EDT
(In reply to comment #51)
> > > > 1. ethtool test failed 
> I did some more looking and the upstream ixgbe driver doesn't support the
> ethtool self_test op, so that's not much of a concern anymore. :)  

It isn't upstream in net-2.6.  I did get it upstream in net-next-2.6, but I don't see this as a need right now.
Comment 53 PJ Waskiewicz 2009-06-10 13:20:17 EDT
(In reply to comment #50)
> PJ, thanks for responding.  In case you are not aware, the best place to grab
> the latest dev trees is here:
> http://people.redhat.com/dzickus/
> and I maintain some kernels with networking patches here:
> http://people.redhat.com/agospoda/
> I would suggest checking my test kernels to use as a basis for these patches.

I'll get these pulled onto my local systems today

> (In reply to comment #49)
> > (In reply to comment #48)
> > 
> > My replies inline below:
> > 
> > > Jesse, please confirm this for me:
> > > (In reply to comment #45)
> > > > We did some testing on this ixgbe driver, we found it is mostly functional, but
> > > > with some remaining bugs.
> > > > 
> > > > 1. ethtool test failed 
> > > > 2. FC is disabled by default
> > > > 3. no warnings for old EEPROM
> > > > 4. unsupported SFP+ detection on Niantic does not seem to work (Intel only
> > > > supports a limited set of SFP modules)
> > > > 5. 82598 LOM (aka SFP+ 82598 LOM - also the Sun Mezz adapter) doesn't get link.
> > > > 
> > > > 1) 2) I think both are already patched upstream
> > > > 
> > > 1)
> > > commit 4bebfaa56b72c94fe4997240ee73ad1c1fcf96c9
> > > Author: Auke Kok <auke-jan.h.kok@intel.com>
> > > Date:   Mon Feb 11 09:26:01 2008 -0800
> > >     ixgbe: Disallow device reset during ethtool test
> > > 2) 
> > > commit cd7664f69fe1f3f75b664503ae3e11a2971a4865
> > > Author: Don Skidmore <donald.c.skidmore@intel.com>
> > > Date:   Tue Mar 31 21:33:44 2009 +0000
> > >     ixgbe: feature - driver to default with FC on.
> > > > 3) is fixed with just a quick check with a printk, it is to help us and users
> > > > weed out pre-production adapters that might have link issues.
> > > Can you elaborate on this a bit?
> > 
> > Our 82599 SFP+ adapters (device 0x10fb) have an internal analog PHY that
> > manages our SFI link to the SFP+ modules.  In our older spins of the board, we
> > found issues where the network link can be falsely indicated up and then down
> > (flapping) when no cable was plugged into the NIC.  To solve this, we added a
> > firmware into the EEPROM which assists in PHY maintenance.
> > 
> > To catch pre-production boards in the field, we added some code to read the
> > EEPROM version, find the FW version, and if it's older than a certain rev, we
> > display a warning to the system log.
> > 
> > This fix isn't critical, so if it comes down to something more critical making
> > it in, this can be dropped.
> > 
> > > > 4) Intel will only support a limited set of SFP+ modules, code needs to be in
> > > > the driver to enforce this correctly.
> > > Is there an upstream fix for this?
> > 
> > Yes.  I can provide a patch if needed; it's the code surrounding the
> > DEVICE_CAPS when referencing the EEPROM.  It may be a bit difficult to extract
> > it properly, since it also has FCoE goo in it.
> > 
> Anything you can do to help is great.  The deadline is the end of *this* week,
> so we need to get this knocked out.  If you can tell me the commit id for this
> upstream, I can take a look.

I will get a concise list of commits and reply to this thread with them.  I'll have them in a few hours for you.

> > > > 5) The upstream kernel driver currently has a link problem with 82598 SFP+ as
> > > > well.  It was a community patch (generic MDIO) that broke it, we are pretty
> > > > close to fixing it. Once the patch for for 82598 SFP+ is pushed upstream, we
> > > > will send the commit directly to RH to include it asap.  
> > > OK, we are are getting *REALLY* close to the deadline here.
> > 
> > Don just fixed this issue, and I know we have the fix locally.  I will see if
> > we can expediate the fix.  As an alternative, you can roll back the MDIO
> > changes that came from Ben Hutchings, and that will also fix this issue.  Let
> > me know what you'd prefer doing.
> > 
> We never took this fix:
> commit 6b73e10d2d89f9ce773f9b47d61b195936d059ba
> Author: Ben Hutchings <bhutchings@solarflare.com>
> Date:   Wed Apr 29 08:08:58 2009 +0000
>     ixgbe: Use generic MDIO definitions and functions
> so I don't see it as something we need to revert. :)

Ok.  I'll need to do some digging to figure out why the 82598 SFP+ devices are broken then.  There's a ton of churn in the SFP+ code for the 82599 stuff, so I'll need to find what is wrong.

> > > > Since we just got final adapter hardware (with new eeprom images and larger
> > > > eeprom chips) a couple of weeks ago we have a few more code changes that should
> > > > be included in RH5.4 to be complete.  
> > > I think I've got someone complaining to me about this already.  Apparently
> > > probe fails when ixgbe_get_sfp_init_sequence_offsets reads
> > > IXGBE_PHY_INIT_OFFSET_NL and gets 0xffff.  Is this what you are talking about?  
> > 
> > Wow, that is an OLD EEPROM!  Yes, that's part of it.  The 4 NICs that were just
> > sent about 2 weeks ago have the final EEPROM on them, which also includes the
> > FW I mentioned before.  These boards are also physically shorter than the
> > original boards you guys have.  I would recommend using them if you can get
> > your hands on them.  We can try and get your older ones replaced as well, but
> > I'm not sure we can replace them in the timeframe for RHEL5.4.  Let me know
> > what you'd like to do.  
> That report came from someone at Stratus.  Should I tell them to get some new
> cards?

Yes.  That EEPROM is based on the original EEPROM for our final hardware, which is a very old EEPROM.  The current drivers won't be able to load (obviously from the error), so Stratus will need to request new NICs from their Intel rep.
Comment 54 PJ Waskiewicz 2009-06-10 14:48:26 EDT
(In reply to comment #53)
> (In reply to comment #50)
> > PJ, thanks for responding.  In case you are not aware, the best place to grab
> > the latest dev trees is here:
> > http://people.redhat.com/dzickus/
> > and I maintain some kernels with networking patches here:
> > http://people.redhat.com/agospoda/
> > I would suggest checking my test kernels to use as a basis for these patches.
> I'll get these pulled onto my local systems today
> > (In reply to comment #49)
> > > (In reply to comment #48)
> > > 
> > > My replies inline below:
> > > 
> > > > Jesse, please confirm this for me:
> > > > (In reply to comment #45)
> > > > > We did some testing on this ixgbe driver, we found it is mostly functional, but
> > > > > with some remaining bugs.
> > > > > 
> > > > > 1. ethtool test failed 
> > > > > 2. FC is disabled by default
> > > > > 3. no warnings for old EEPROM
> > > > > 4. unsupported SFP+ detection on Niantic does not seem to work (Intel only
> > > > > supports a limited set of SFP modules)
> > > > > 5. 82598 LOM (aka SFP+ 82598 LOM - also the Sun Mezz adapter) doesn't get link.
> > > > > 
> > > > > 1) 2) I think both are already patched upstream
> > > > > 
> > > > 1)
> > > > commit 4bebfaa56b72c94fe4997240ee73ad1c1fcf96c9
> > > > Author: Auke Kok <auke-jan.h.kok@intel.com>
> > > > Date:   Mon Feb 11 09:26:01 2008 -0800
> > > >     ixgbe: Disallow device reset during ethtool test
> > > > 2) 
> > > > commit cd7664f69fe1f3f75b664503ae3e11a2971a4865
> > > > Author: Don Skidmore <donald.c.skidmore@intel.com>
> > > > Date:   Tue Mar 31 21:33:44 2009 +0000
> > > >     ixgbe: feature - driver to default with FC on.
> > > > > 3) is fixed with just a quick check with a printk, it is to help us and users
> > > > > weed out pre-production adapters that might have link issues.
> > > > Can you elaborate on this a bit?
> > > 
> > > Our 82599 SFP+ adapters (device 0x10fb) have an internal analog PHY that
> > > manages our SFI link to the SFP+ modules.  In our older spins of the board, we
> > > found issues where the network link can be falsely indicated up and then down
> > > (flapping) when no cable was plugged into the NIC.  To solve this, we added a
> > > firmware into the EEPROM which assists in PHY maintenance.
> > > 
> > > To catch pre-production boards in the field, we added some code to read the
> > > EEPROM version, find the FW version, and if it's older than a certain rev, we
> > > display a warning to the system log.
> > > 
> > > This fix isn't critical, so if it comes down to something more critical making
> > > it in, this can be dropped.
> > > 
> > > > > 4) Intel will only support a limited set of SFP+ modules, code needs to be in
> > > > > the driver to enforce this correctly.
> > > > Is there an upstream fix for this?
> > > 
> > > Yes.  I can provide a patch if needed; it's the code surrounding the
> > > DEVICE_CAPS when referencing the EEPROM.  It may be a bit difficult to extract
> > > it properly, since it also has FCoE goo in it.
> > > 
> > Anything you can do to help is great.  The deadline is the end of *this* week,
> > so we need to get this knocked out.  If you can tell me the commit id for this
> > upstream, I can take a look.
> I will get a concise list of commits and reply to this thread with them.  I'll
> have them in a few hours for you.

Here are the commits from Dave Miller's net-next-2.6 tree that you should refer to:

commit	aa5aec888585fedcda7cfffc20f75240ad1cb42d - ixgbe: Add semaphore access for PHY initialization for 82599

commit	1479ad4fbfbc801898dce1ac2d4d44f0c774ecc5 - ixgbe: Change the 82599 PHY DSP restart logic

The above two are critical.  The next one is optional, and can be left out at your discretion:

commit	794caeb259bc5d341bcc80dd37820073147a231c - ixgbe: Add FW detection and warning for 82599 SFP+ adapters

I haven't been able to find the commit that adds the get_device_caps() support, which is what we use to properly identify the SFP+ modules and reject modules that aren't on our whitelist.  I will keep looking, but I may just say let's go with what we have for now.  If I can't find it by EOD today, let's drop that request.

> > > > > 5) The upstream kernel driver currently has a link problem with 82598 SFP+ as
> > > > > well.  It was a community patch (generic MDIO) that broke it, we are pretty
> > > > > close to fixing it. Once the patch for for 82598 SFP+ is pushed upstream, we
> > > > > will send the commit directly to RH to include it asap.  
> > > > OK, we are are getting *REALLY* close to the deadline here.
> > > 
> > > Don just fixed this issue, and I know we have the fix locally.  I will see if
> > > we can expediate the fix.  As an alternative, you can roll back the MDIO
> > > changes that came from Ben Hutchings, and that will also fix this issue.  Let
> > > me know what you'd prefer doing.
> > > 
> > We never took this fix:
> > commit 6b73e10d2d89f9ce773f9b47d61b195936d059ba
> > Author: Ben Hutchings <bhutchings@solarflare.com>
> > Date:   Wed Apr 29 08:08:58 2009 +0000
> >     ixgbe: Use generic MDIO definitions and functions
> > so I don't see it as something we need to revert. :)
> Ok.  I'll need to do some digging to figure out why the 82598 SFP+ devices are
> broken then.  There's a ton of churn in the SFP+ code for the 82599 stuff, so
> I'll need to find what is wrong.
> > > > > Since we just got final adapter hardware (with new eeprom images and larger
> > > > > eeprom chips) a couple of weeks ago we have a few more code changes that should
> > > > > be included in RH5.4 to be complete.  
> > > > I think I've got someone complaining to me about this already.  Apparently
> > > > probe fails when ixgbe_get_sfp_init_sequence_offsets reads
> > > > IXGBE_PHY_INIT_OFFSET_NL and gets 0xffff.  Is this what you are talking about?  
> > > 
> > > Wow, that is an OLD EEPROM!  Yes, that's part of it.  The 4 NICs that were just
> > > sent about 2 weeks ago have the final EEPROM on them, which also includes the
> > > FW I mentioned before.  These boards are also physically shorter than the
> > > original boards you guys have.  I would recommend using them if you can get
> > > your hands on them.  We can try and get your older ones replaced as well, but
> > > I'm not sure we can replace them in the timeframe for RHEL5.4.  Let me know
> > > what you'd like to do.  
> > That report came from someone at Stratus.  Should I tell them to get some new
> > cards?
> Yes.  That EEPROM is based on the original EEPROM for our final hardware, which
> is a very old EEPROM.  The current drivers won't be able to load (obviously
> from the error), so Stratus will need to request new NICs from their Intel rep.
Comment 55 PJ Waskiewicz 2009-06-11 01:00:11 EDT
Update: we re-tested the report that the 82598 SFP+ devices were not linking with the RHEL5.4 driver.  The symptom is that the link intermittently takes a long time to come up (up to 10 seconds), but once it's up, it's stable.  Other times the link immediately comes up.  However, in none of the testing does the link fail to come online.  We will continue to test this scenario, but at this point we are treating this as a low priority issue, and recommend going forward with no additional driver changes for this issue.

If RH wants to see a fix for this, please advise.
Comment 56 Andy Gospodarek 2009-06-11 17:02:44 EDT
My test kernels have been updated to include a patch for this bugzilla.

http://people.redhat.com/agospoda/#rhel5

Please test them and report back your results.  Without immediate
feedback there is a good chance this or any other fix for this driver
will not be included in the upcoming update.
Comment 57 Andy Gospodarek 2009-06-12 13:56:04 EDT
PJ, my test kernels have been updates with the code that we plan to ship for 5.4.  Could you or someone else help verify them?  We realize that based on list in comment #45 there are still some outstanding issues:

1. ethtool test failed 

but I'm not sure if these have been resolved or if we can live with these or any additional problems.

4. unsupported SFP+ detection on Niantic does not seem to work (Intel only
supports a limited set of SFP modules)
5. 82598 LOM (aka SFP+ 82598 LOM - also the Sun Mezz adapter) doesn't get link.

Thanks for all the help on this.
Comment 58 PJ Waskiewicz 2009-06-12 14:36:10 EDT
(In reply to comment #57)
> PJ, my test kernels have been updates with the code that we plan to ship for
> 5.4.  Could you or someone else help verify them?  We realize that based on
> list in comment #45 there are still some outstanding issues:
> 1. ethtool test failed 
> but I'm not sure if these have been resolved or if we can live with these or
> any additional problems.

The ethtool test failing is fine.  We've never had ethtool test support in ixgbe until very recently in Dave Miller's net-next-2.6.  So not having it here is not a problem.

> 4. unsupported SFP+ detection on Niantic does not seem to work (Intel only
> supports a limited set of SFP modules)

We can let this one go.  The good thing is SFP+ modules are functional at this point; I'd be worried if supported modules weren't working.

> 5. 82598 LOM (aka SFP+ 82598 LOM - also the Sun Mezz adapter) doesn't get link.

We've retested this and it's not that it won't get link, it just takes a long time to get link.  It's intermittent though, but link will always come up.  This is a non-issue at this point.

> Thanks for all the help on this.  

You bet.  We should be in good shape from here.  Your 74 kernel was tested and given the green light yesterday.
Comment 61 Chris Ward 2009-07-03 14:13:55 EDT
~~ Attention - RHEL 5.4 Beta Released! ~~

RHEL 5.4 Beta has been released! There should be a fix present in the Beta release that addresses this particular request. Please test and report back results here, at your earliest convenience. RHEL 5.4 General Availability release is just around the corner!

If you encounter any issues while testing Beta, please describe the issues you have encountered and set the bug into NEED_INFO. If you encounter new issues, please clone this bug to open a new issue and request it be reviewed for inclusion in RHEL 5.4 or a later update, if it is not of urgent severity.

Please do not flip the bug status to VERIFIED. Only post your verification results, and if available, update Verified field with the appropriate value.

Questions can be posted to this bug or your customer or partner representative.
Comment 62 John Ronciak 2009-07-08 04:59:47 EDT
During the testing of the I/OAT code it was found that the 5.4 ixgbe driver does not have the end-point DCA code change in it.  As such only DMA testing was possible with the included ixgbe driver.  We would like to be able to test end-point DCA but can't at this point.  Support has been upstream for some time now.  Please advise.
Comment 63 Andy Gospodarek 2009-07-08 08:23:43 EDT
(In reply to comment #62)
> During the testing of the I/OAT code it was found that the 5.4 ixgbe driver
> does not have the end-point DCA code change in it.  As such only DMA testing
> was possible with the included ixgbe driver.  We would like to be able to test
> end-point DCA but can't at this point.  Support has been upstream for some time
> now.  Please advise.  

The ixgbe driver originally did not any of the DCA bits as they were upstream, but not in RHEL5.  This has changed in RHEL5.4 as DCA was added, but I do not anticipate support for DCA in ixgbe until RHEL5.5.
Comment 64 John Ronciak 2009-07-08 08:38:08 EDT
I think that some of the OEM's (from Japan, NEC, FSC and Hitachi) are looking for this.  Is it the right decision?  Do they know?
Comment 65 Andy Gospodarek 2009-07-08 09:07:27 EDT
(In reply to comment #64)
> I think that some of the OEM's (from Japan, NEC, FSC and Hitachi) are looking
> for this.  Is it the right decision?  Do they know?  

I'm not sure.  I guess we will find out! :-)
Comment 66 Chris Ward 2009-07-10 15:06:36 EDT
~~ Attention Partners - RHEL 5.4 Snapshot 1 Released! ~~

RHEL 5.4 Snapshot 1 has been released on partners.redhat.com. If you have already reported your test results, you can safely ignore this request. Otherwise, please notice that there should be a fix available now that addresses this particular request. Please test and report back your results here, at your earliest convenience. The RHEL 5.4 exception freeze is quickly approaching.

If you encounter any issues while testing Beta, please describe the issues you have encountered and set the bug into NEED_INFO. If you encounter new issues, please clone this bug to open a new issue and request it be reviewed for inclusion in RHEL 5.4 or a later update, if it is not of urgent severity.

Do not flip the bug status to VERIFIED. Instead, please set your Partner ID in the Verified field above if you have successfully verified the resolution of this issue. 

Further questions can be directed to your Red Hat Partner Manager or other appropriate customer representative.
Comment 69 Jan Tluka 2009-07-20 11:00:25 EDT
Patches are in -158.el5 kernel. SanityOnly.
Comment 75 IBM Bug Proxy 2009-07-23 09:03:08 EDT
For no such Intel?? 82599 10GbE controller (formerly codenamed ???Niantic???)
in RHTS, we only check patches sanity, and Jan have finished that work.
Comment 76 John Ronciak 2009-07-23 11:19:10 EDT
The next snapshot will be tested on Niantic NICs.  So the only thing left would be to report the outcome of this testing.  We'll do this after the next snap is available.
Comment 77 Chris Ward 2009-08-03 11:44:47 EDT
~~ Attention Partners - RHEL 5.4 Snapshot 5 Released! ~~

RHEL 5.4 Snapshot 5 is the FINAL snapshot to be release before RC. It has been 
released on partners.redhat.com. If you have already reported your test results, 
you can safely ignore this request. Otherwise, please notice that there should be 
a fix available now that addresses this particular issue. Please test and report 
back your results here, at your earliest convenience.

If you encounter any issues while testing Beta, please describe the 
issues you have encountered and set the bug into NEED_INFO. If you 
encounter new issues, please clone this bug to open a new issue and 
request it be reviewed for inclusion in RHEL 5.4 or a later update, if it 
is not of urgent severity. If it is urgent, escalate the issue to your partner manager as soon as possible. There is /very/ little time left to get additional code into 5.4 before GA.

Partners, after you have verified, do not flip the bug status to VERIFIED. Instead, please set your Partner ID in the Verified field above if you have successfully verified the resolution of this issue. 

Further questions can be directed to your Red Hat Partner Manager or other 
appropriate customer representative.
Comment 78 Chris Ward 2009-08-10 06:03:28 EDT
John, any updates on testing?
Comment 79 John Ronciak 2009-08-10 11:44:47 EDT
Work continues on the FC issues.  We will know more in the next few days.
Comment 81 Chris Ward 2009-09-01 09:11:19 EDT
Intel, any updates?
Comment 82 errata-xmlrpc 2009-09-02 04:13:58 EDT
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2009-1243.html
Comment 83 Chris Ward 2009-09-09 07:15:29 EDT
John, Intel?
Comment 84 John Ronciak 2009-09-09 11:28:26 EDT
The FC issues still need to be worked on for 5.5.  It is working upstream and in our stand-alone driver version so it's a merge thing somehow.

Also, the end-point DCA support has been moved to a separate BZ, 514306 so we should be covered there.
Comment 85 John Ronciak 2009-09-09 13:55:47 EDT
We would like to see RH take a look at the FC problem.  It has to be a merge thing where either a patch didn't get applied or one got backported incorrectly.  Since this is working both upstream and in our stand-alone driver, it has to be a backport/merge issue.
Comment 86 Andy Gospodarek 2009-09-09 14:03:05 EDT
John, I will make sure the FC problems are addressed for 5.5.

Note You need to log in before you can comment on or make changes to this bug.