Bug 654948

Summary: RHEL5.6 : 10Gb network card (AD144 &AD385)will be missing in installation and can not be drived in system
Product: Red Hat Enterprise Linux 5 Reporter: duxuewen <xue-wen.du>
Component: kernelAssignee: bob picco <bpicco>
Status: CLOSED ERRATA QA Contact: Network QE <network-qe>
Severity: urgent Docs Contact:
Priority: high    
Version: 5.6CC: adam.vinsh, arozansk, dawei.pang, hjia, jiayin.shao, joseph.szczypek, joshua.powers, kzhang, li.zhang6, mschmidt, myamazak, ohudlick, shawn.pagan, shengliang.lv, shi.ze, tcamuso
Target Milestone: rcKeywords: OtherQA, Regression
Target Release: ---   
Hardware: ia64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-01-13 22:01:45 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 502912    
Attachments:
Description Flags
patch for fixing the s2io initialize
none
sizes of structs on x86_64 (pahole s2io.o) none

Description duxuewen 2010-11-19 06:17:18 UTC
Description of problem:
During in installation, stage1:can find and use the ad144(ad385) to install rhel5.6, stage2: missing ad144(ad385) when configure all network cards.
Login system: can find the 10gb card in -lspci but can not be drived.


Version-Release number of selected component (if applicable):
rx2660    AD144A: in slot (1)  (efi_2.0.4.6)
          AD337A: in slot (2)
          AD338A: in slot (3)
[rx2660-12] MP:CM> sysrev
SYSREV
Current firmware revisions
 MP FW     : F.02.25
 BMC FW    : 05.26
 EFI FW    : ROM A 07.14, ROM B 07.14
 System FW : ROM A 04.15, ROM B 04.11, Boot ROM A
 PDH FW    : 50.07
 UCIO FW   : 03.0b
 PRS FW    : 00.08 UpSeqRev: 02, DownSeqRev: 01


How reproducible:

Steps to Reproduce:
1.install AD144A or AD385A on rx2660 (all integrity server)
2.install rhel5.6 b1 or s1
3.login system:
[root@mincm ~]# cd /etc/sysconfig/network-scripts/
[root@mincm network-scripts]# ls ifcfg-*
  
Actual results:
missing AD144A or AD385A in stage2
can not be drived in system

Expected results:
AD144A or AD385A  work normal

Additional info:
login system
[root@max ~]# sutl nics
eth0 (AD337A) 00:1a:4b:f3:05:cc <p0> e1000e [1000Mb/s]
eth1 (AD337A) 00:1a:4b:f3:05:cd <p1> e1000e [1000Mb/s]
eth2 (AD338A) 00:1a:4b:f3:58:9a <p0> e1000e [Unknown!]
eth3 (AD338A) 00:1a:4b:f3:58:9b <p1> e1000e [Unknown!]
eth4 (rx2660) 00:17:a4:99:1d:0f <p0> tg3    [1000Mb/s]
eth5 (rx2660) 00:17:a4:99:1d:0e <p1> tg3    [1000Mb/s]
[root@max ~]# sutl cards
Unrecognized PCI Devices (3):
  Unknown location:
    103c:403b-0000:0000 PCI bridge: HP PCIe Root Port (pcieport-driver)

Recognized PCI Devices (16):
  Unknown location:
    MP [HP Management Processor]
    RUSA-Serial [Ruby/Sapphire Unified Core I/O board] (serial)
    RUSA-USB [Ruby/Sapphire Unified Core I/O board] (ohci_hcd)
    RUSA-USB2 [Ruby/Sapphire Unified Core I/O board] (ehci_hcd)
    Merlion-VGA [Embedded I/O subsystems for Merlion]
    Merlion-Sputnik [Embedded I/O subsystems for Merlion] (mptsas)
    rx2660 [Embedded I/O for rx2660] (tg3)
    AD337A [HP PCIe 2-port 1000Base-T Card] (e1000e)
    AD397A/AD348A#008 (Spawn 1/2) [Smart Array P400 SAS RAID PCI-e] (cciss)
    AD144A [S2io Xframe 10Gig-E PCI-X]
    AD338A [HP PCIe 2-port 1000Base-SX Card ] (e1000e)
[root@max ~]# modprobe -r s2io
[root@max ~]# modprobe s2io
alloc_dev: Private data too big.


dmesg can find this message:
s2io: s2io_init_nic: Using 64bit DMA
alloc_dev: Private data too big.
s2io: Device allocation failed

Comment 1 Andy Gospodarek 2010-11-19 20:30:22 UTC
Did this work on RHEL5.5?

It looks like a change to the size of the s2io_nic structure might be causing this, so I'll have to look at what changed in RHEL5.4 and RHEL5.5 as it doesn't look like much changed in RHEL5.5 that would cause this.

Comment 2 duxuewen 2010-11-22 06:51:11 UTC
(In reply to comment #1)
> Did this work on RHEL5.5?
> It looks like a change to the size of the s2io_nic structure might be causing
> this, so I'll have to look at what changed in RHEL5.4 and RHEL5.5 as it doesn't
> look like much changed in RHEL5.5 that would cause this.


Hi,andy, i have install RHEL5.5 on rx2660
rx2660    AD144A: in slot (1)  (efi_2.0.4.6)
          AE311A: in slot (2)
          AD338A: in slot (3)

AD144A can be drived and work normal

[root@max ~]# sutl nics
eth0 (AD144A) 00:0c:fc:00:2f:4a <p0> S2IO   [10000Mb/s]
eth1 (AD337A) 00:1a:4b:f3:05:cc <p0> e1000e [Unknown!]
eth2 (AD337A) 00:1a:4b:f3:05:cd <p1> e1000e [Unknown!]
eth3 (rx2660) 00:17:a4:99:1d:0f <p0> tg3    [1000Mb/s]
eth4 (rx2660) 00:17:a4:99:1d:0e <p1> tg3    [1000Mb/s]

Comment 3 Jiayin Shao 2010-11-22 08:59:00 UTC
hi, 
This is Jiayin of China QA team who maintain Redhat defect list. I added me to this issue's cc list, but after I updated it, I found below information (my email is in Excluding list), and I still can't receive the updating for this defect. I wonder why it is. Anyone could tell me?

Changes submitted for bug 654948
Email sent to:
bugsfx, agospoda, shengliang.lv, xue-wen.du, adam.vinsh, joseph.szczypek, arozansk, dag, shawn.pagan, bugbot.org, li.zhang6, shi.ze 
Excluding:
submit.redhat.com, kernel-qe, jiayin.shao 


Jiayin

Comment 4 Jiayin Shao 2010-11-22 15:02:42 UTC
hi, 
This is Jiayin of China QA team who maintain Redhat defect list. I added me to this issue's cc list, but after I updated it, I found below information (my email is in Excluding list), and I still can't receive the updating for this defect. I wonder why it is. Anyone could tell me?

Changes submitted for bug 654948
Email sent to:
bugsfx, agospoda, shengliang.lv, xue-wen.du, adam.vinsh, joseph.szczypek, arozansk, dag, shawn.pagan, bugbot.org, li.zhang6, shi.ze 
Excluding:
submit.redhat.com, kernel-qe, jiayin.shao 


Jiayin

Comment 5 Andy Gospodarek 2010-11-23 22:18:17 UTC
(In reply to comment #4)
> hi, 
> This is Jiayin of China QA team who maintain Redhat defect list. I added me to
> this issue's cc list, but after I updated it, I found below information (my
> email is in Excluding list), and I still can't receive the updating for this
> defect. I wonder why it is. Anyone could tell me?
> 
> Changes submitted for bug 654948
> Email sent to:
> bugsfx, agospoda, shengliang.lv, xue-wen.du,
> adam.vinsh, joseph.szczypek, arozansk, dag,
> shawn.pagan, bugbot.org, li.zhang6,
> shi.ze 
> Excluding:
> submit.redhat.com, kernel-qe, jiayin.shao 
> 
> 
> Jiayin

If you make the update to the bugzilla, you will not get an email about the update to the bugzilla and will be on the 'excluding' list.

Comment 6 Jiayin Shao 2010-11-24 05:18:24 UTC
(In reply to comment #5)
> If you make the update to the bugzilla, you will not get an email about the
> update to the bugzilla and will be on the 'excluding' list.

Got it. thanks!

Comment 7 Dawei Pang 2010-11-30 06:07:04 UTC
Created attachment 463650 [details]
patch for fixing the s2io initialize

When the s2io initialize(s2io_init_nic), I found the sizeof(struct s2io_nic) = 73344, it is larger than NETDEV_PRIV_LEN_MAX 0X0000FFFF(64K) which compared in the alloc_netdev, so "Private data too big" is reported and device allocation failed.

NETDEV_PRIV_LEN_MAX and compared section are added by patch: linux-2.6-net-qla3xxx-fix-oops-on-too-long-netdev-priv-structure.patch

I remove some related codes for workaround, the s2io can work.
There is the file s2io_fix_init.patch in the attachment, hope it can help us fix this issue.

By the way we need take care if the changes will cause another issue.

Thanks,
Dawei

Comment 9 Andy Gospodarek 2010-12-01 16:01:48 UTC
(In reply to comment #7)
> Created attachment 463650 [details]
> patch for fixing the s2io initialize
> 
> When the s2io initialize(s2io_init_nic), I found the sizeof(struct s2io_nic) =
> 73344, it is larger than NETDEV_PRIV_LEN_MAX 0X0000FFFF(64K) which compared in
> the alloc_netdev, so "Private data too big" is reported and device allocation
> failed.
> 
> NETDEV_PRIV_LEN_MAX and compared section are added by patch:
> linux-2.6-net-qla3xxx-fix-oops-on-too-long-netdev-priv-structure.patch
> 
> I remove some related codes for workaround, the s2io can work.
> There is the file s2io_fix_init.patch in the attachment, hope it can help us
> fix this issue.
> 
> By the way we need take care if the changes will cause another issue.
> 
> Thanks,
> Dawei

The attached patch will have other side effects, so we cannot use it.

What will need to happen is to put the s2io_nic structure on a diet and convert some of the data stored in the structure to pointers to allocated memory.

The best thing will be to load a system with crash and look at what elements are taking up the most space and can be moved around.

Reassiging to Bob as he should be able to quickly knock this out.

Comment 10 Michal Schmidt 2010-12-02 12:31:24 UTC
Created attachment 464238 [details]
sizes of structs on x86_64 (pahole s2io.o)

pahole is a nice tool to explore sizes of structures. Attached is the full output of "pahole s2io.o" on x86_64.

The biggest members are:

struct s2io_nic {
...
	struct mac_info            mac_control;          /*  size: 65920 */
...
/* size: 73344 */
};

struct mac_info {
...
	struct ring_info           rings[8];             /*  size: 64512 */
...
/* size: 65920 */
};

struct ring_info {
...
	struct lro                 lro0_n[32];           /*  size: 4096 */
...
	struct rx_block_info       rx_blocks[150];       /*  size: 3600 */
...
/* size: 8064 */
};

Comment 13 bob picco 2010-12-07 14:09:47 UTC
Please test a kernel rpm at: http://people.redhat.com/~bpicco/.bz654948/ ,
We've been unable to find working local hardware.

thanx,

bob

Comment 14 Tony Camuso 2010-12-07 19:47:23 UTC
Adding Shawn Pagan of hp in the CC list.

Shawn, does your group have hardware that can test this kernel?

Comment 16 Dawei Pang 2010-12-08 07:36:30 UTC
I downloaded kernel file from http://people.redhat.com/~bpicco/.bz654948/kernel-2.6.18-235.el5.s2iov3.ia64.rpm, installed it on the RHEL5.6S3(IA64, rx2660) and reboot.

The Ethernet port of AD144 or AD385 can be found and ping successfully.

By the way, which snapshot will plan to add this fix? At that time, I will do some stress test for this driver.

The followed is some information cut from dmesg:
-----------------------cut from dmesg--------------------
GSI 52 (level, low) -> CPU 2 (0x0200) vector 69
ACPI: PCI Interrupt 0000:06:01.0[A] -> GSI 52 (level, low) -> IRQ 69
s2io: s2io_init_nic: Using 64bit DMA
s2io: eth%d: Ring Mem PHY: 0x100ec220000
s2io: s2io_reset: Resetting XFrame card eth%d
PM: Writing back config space on device 0000:06:01.0 at offset 1 (was 2300142, writing 2300146)
s2io: Copyright(c) 2002-2007 Neterion Inc.
s2io: eth2: Neterion HP PCI-X 266MHz 10GbE SR Fiber Adapter   (rev 2)
s2io: eth2: Driver version 2.0.26.25
s2io: eth2: MAC Address: 00:0c:fc:00:58:23
s2io: Serial number: SXT0808109
s2io: eth2: Device is on 64 bit 133MHz PCIX(M1) bus
s2io: eth2: 1-Buffer receive mode enabled
s2io: eth2: NAPI enabled
s2io: eth2: Using 1 Tx fifo(s)
s2io: eth2: Using 1 Rx ring(s)
s2io: eth2: Interrupt type INTA
s2io: eth2: Multiqueue support disabled
s2io: eth2: No steering enabled for transmit
s2io: Fifo partition at: 0xc000080680101108 is: 0xfff00000000
s2io: eth2: Next block at: e0000100ec848000
s2io: eth2: Next block at: e0000100ec84c000
s2io: eth2: Next block at: e0000100ecf48000
s2io: eth2: Next block at: e0000100ecf4c000
s2io: eth2: Next block at: e0000100ebf80000
s2io: eth2: Next block at: e0000100ebf84000
s2io: eth2: Next block at: e0000100ec348000
s2io: eth2: Next block at: e0000100ec34c000
s2io: Buf in ring:0 is 3810:
s2io: eth2: Link Up
s2io: eth2: In Neterion Tx routine
s2io: eth2: Next block at: e0000100eb8dc000
s2io: eth2: In Neterion Tx routine
s2io: eth2: In Neterion Tx routine
s2io: eth2: Next block at: e0000100ec848000
s2io: eth2: In Neterion Tx routine
s2io: eth2: In Neterion Tx routine
s2io: eth2: Next block at: e0000100ec84c000
s2io: eth2: In Neterion Tx routine
s2io: eth2: In Neterion Tx routine
s2io: eth2: Next block at: e0000100ecf48000
s2io: eth2: In Neterion Tx routine
s2io: eth2: In Neterion Tx routine
s2io: eth2: In Neterion Tx routine
s2io: eth2: In Neterion Tx routine
s2io: eth2: In Neterion Tx routine
s2io: eth2: In Neterion Tx routine
s2io: eth2: Next block at: e0000100ecf4c000
s2io: eth2: Next block at: e0000100ebf80000
s2io: eth2: In Neterion Tx routine
s2io: eth2: Next block at: e0000100ebf84000
s2io: eth2: Next block at: e0000100ec348000
------------------------End----------------------------


Thanks,
Dawei

Comment 17 RHEL Program Management 2010-12-08 12:22:24 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 18 RHEL Program Management 2010-12-08 12:24:30 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 24 Jarod Wilson 2010-12-14 14:28:15 UTC
in kernel-2.6.18-237.el5
You can download this test kernel (or newer) from http://people.redhat.com/jwilson/el5

Detailed testing feedback is always welcomed.

Comment 26 Dawei Pang 2010-12-15 14:03:49 UTC
The s2io in the kernel-2.6.18-237.el5.ia64.rpm can work.
The ethernet port of AD144 or AD385 can be found and ping successfully.
I will run the 24 hours network stress test for this kernel, and feedback the result tomorrow.

The followed is some information cut from dmesg:
-----------------------cut from dmesg after insmod s2io.ko---------------------
GSI 38 (level, low) -> CPU 0 (0x0000) vector 64
ACPI: PCI Interrupt 0000:0a:01.0[A] -> GSI 38 (level, low) -> IRQ 64
PM: Writing back config space on device 0000:0a:01.0 at offset c (was 0, writing a0000000)
PM: Writing back config space on device 0000:0a:01.0 at offset 7 (was 0, writing 802)
PM: Writing back config space on device 0000:0a:01.0 at offset 6 (was c, writing 8000000c)
PM: Writing back config space on device 0000:0a:01.0 at offset 5 (was 0, writing 802)
PM: Writing back config space on device 0000:0a:01.0 at offset 4 (was c, writing 8010000c)
PM: Writing back config space on device 0000:0a:01.0 at offset 3 (was 4000, writing 4020)
PM: Writing back config space on device 0000:0a:01.0 at offset 1 (was 2300000, writing 2300146)
PM: Writing back config space on device 0000:0a:01.0 at offset c (was 0, writing a0000000)
PM: Writing back config space on device 0000:0a:01.0 at offset 7 (was 0, writing 802)
PM: Writing back config space on device 0000:0a:01.0 at offset 6 (was c, writing 8000000c)
PM: Writing back config space on device 0000:0a:01.0 at offset 5 (was 0, writing 802)
PM: Writing back config space on device 0000:0a:01.0 at offset 4 (was c, writing 8010000c)
PM: Writing back config space on device 0000:0a:01.0 at offset 3 (was 4000, writing 4020)
PM: Writing back config space on device 0000:0a:01.0 at offset 1 (was 2300000, writing 2300146)
s2io: Copyright(c) 2002-2007 Neterion Inc.
s2io: eth2: Neterion HP PCI-X 133MHz 10GbE SR Fiber Adapter   (rev 4)
s2io: eth2: Driver version 2.0.26.25
s2io: eth2: MAC Address: 00:0c:fc:00:2f:4a
s2io: Serial number: SXT0710158
s2io: eth2: 1-Buffer receive mode enabled
s2io: eth2: NAPI enabled
s2io: eth2: Using 1 Tx fifo(s)
s2io: eth2: Using 1 Rx ring(s)
s2io: eth2: Interrupt type INTA
s2io: eth2: Multiqueue support disabled
s2io: eth2: No steering enabled for transmit
GSI 67 (level, low) -> CPU 1 (0x0200) vector 65
ACPI: PCI Interrupt 0000:4a:01.0[A] -> GSI 67 (level, low) -> IRQ 65
PM: Writing back config space on device 0000:4a:01.0 at offset 1 (was 2300142, writing 2300146)
s2io: eth2: Link Up
s2io: Copyright(c) 2002-2007 Neterion Inc.
s2io: eth0: Neterion HP PCI-X 266MHz 10GbE SR Fiber Adapter   (rev 2)
s2io: eth0: Driver version 2.0.26.25
s2io: eth0: MAC Address: 00:0c:fc:00:4d:ca
s2io: Serial number: SXT0740103
s2io: eth0: Device is on 64 bit 133MHz PCIX(M1) bus
s2io: eth0: 1-Buffer receive mode enabled
s2io: eth0: NAPI enabled
s2io: eth0: Using 1 Tx fifo(s)
s2io: eth0: Using 1 Rx ring(s)
s2io: eth0: Interrupt type INTA
s2io: eth0: Multiqueue support disabled
s2io: eth0: No steering enabled for transmit

----------------------------end-------------------------------------------------

Comment 27 Jiayin Shao 2010-12-17 03:44:38 UTC
(In reply to comment #18)
> This request was evaluated by Red Hat Product Management for inclusion in a Red
> Hat Enterprise Linux maintenance release.  Product Management has requested
> further review of this request by Red Hat Engineering, for potential
> inclusion in a Red Hat Enterprise Linux Update release for currently deployed
> products.  This request is not yet committed for inclusion in an Update
> release.

This issue is a critical problem for HP which may affect rhel5.6's LR. And since the snapshot5 is a last snapshot version, I wonder which maintenence release you are going to resolve it.

Jiayin

Comment 28 Dawei Pang 2010-12-17 06:46:19 UTC
I ran the 24 hours network stress test with kernel-2.6.18-237.el5.ia64.rpm , it is PASS

Comment 29 Jiayin Shao 2010-12-20 06:20:21 UTC
HP strongly hope you could resolve this issue before RC release.
Jiayin

Comment 33 errata-xmlrpc 2011-01-13 22:01:45 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2011-0017.html