Bug 470625

Summary: Netdump not functioning w/ bnx2 >= v1.8h (Broadcom Netxtreme II Network Card)
Product: Red Hat Enterprise Linux 5 Reporter: Joe T <jtorrice>
Component: kernelAssignee: Neil Horman <nhorman>
Status: CLOSED ERRATA QA Contact: Martin Jenner <mjenner>
Severity: high Docs Contact:
Priority: high    
Version: 5.3CC: agospoda, andriusb, cward, jho, mchan, mgahagan, nhorman, qcai, syeghiay
Target Milestone: rcKeywords: OtherQA, Reopened
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: bnx2 v1.8.1g Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2009-01-20 20:02:58 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 440221    

Description Joe T 2008-11-08 01:54:42 UTC
Description of problem:
Netdump not functioning w/ bnx2 >= v1.8h (Broadcom Netxtreme II Network Card): System under test (where Netdump is running), does not send a memory dump to a remote Netdump-server prior to rebooting.

Our developers for bnx2, have looked into the issue. Their conclusion at this time is the issue is a Netdump bug or limitation.


Version-Release number of selected component (if applicable):
Netdump 0.7.16-14 (RHEL 4.7 Gold)

How reproducible:
1) USe RHEL 4.7 Gold x86-64
2) Netdump (0.7.16-14); included with RHEL 4.7 Gold
3) Server system w/ a 5708 LoM (LAN on Motherboard)
   ie: HP DL 385G5
       Dell PowerEdge 1900

Steps to Reproduce:
1. Configure netdump
  a) vi /etc/sysconfig/netdump
  b) Add: 
       DEV=eth0
       NETDUMPADDR=172.16.x.x
  c) save and exit

2. Start netdump daemon:
  a) service netdump start
  b) accept unverfied key (yes)
  c) enter password netdump user

3. Crash the system under test (SUT)
  a) echo c > /proc/sysrq-tigger
 
Actual results:
SUT does not send memory dump to a remote server prior to rebooting.
<0>CR2: 0000000000000036
<3>netpoll_start_netdump: called recursively.  rebooting.

See "Full Trace" below

Expected results:
Netdump should dump memory of crashed system to a remote server over a LAN


Full Trace:

SysRq : Crashing the kernel by request
Unable to handle kernel NULL pointer dereference at 0000000000000000 RIP:
<ffffffff8023c40c>{sysrq_handle_crash+0}
PML4 117110067 PGD 116e82067 PMD 0
Oops: 0002 [1] SMP
CPU 0
Modules linked in: mptctl mptbase sg ipmi_si(U) ipmi_devintf(U)
ipmi_msghandler(U) md5 ipv
6 parport_pc lp parport netconsole netdump autofs4 i2c_dev i2c_core
hp_ilo(U) sunrpc 8021q  ds yenta_socket pcmcia_core joydev button battery ac ehci_hcd uhci_hcd hw_random bnx2(U) dm_snapshot dm_zero dm_mirror ext3 jbd dm_mod ata_piix libata cciss(U) sd_mod scsi_mod
Pid: 27256, comm: bash Not tainted 2.6.9-67.ELsmp
RIP: 0010:[<ffffffff8023c40c>] <ffffffff8023c40c>{sysrq_handle_crash+0}
RSP: 0018:000001011899dec0  EFLAGS: 00010216
RAX: 000000000000001f RBX: ffffffff80406f40 RCX: ffffffff803e95e8
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000063
RBP: 0000000000000063 R08: ffffffff803e95e8 R09: ffffffff80406f40
R10: 0000000100000000 R11: 0000000000000000 R12: 0000000000000000
R13: 0000000000000000 R14: 0000000000000006 R15: 0000000000000000
FS:  0000002a955773e0(0000) GS:ffffffff804f2d80(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000000 CR3: 0000000000101000 CR4: 00000000000006e0 Process bash (pid: 27256, threadinfo 000001011899c000, task 0000010122249030)
Stack: ffffffff8023c5c2 000001011899c000 0000000000000002 000001011899df50
       0000000000000002 0000002a984d3000 ffffffff801b2741 0000000000000048
       0000000000000000 000001011bd13ec0 Call Trace:<ffffffff8023c5c2>{__handle_sysrq+102}
<ffffffff801b2741>{write_sysrq_trigger+4
3}
       <ffffffff8017af0e>{vfs_write+207} <ffffffff8017aff6>{sys_write+69}
       <ffffffff8011026a>{system_call+126}

Code: c6 04 25 00 00 00 00 00 c3 e9 1c ff f3 ff e9 55 4d f4 ff 48 RIP <ffffffff8023c40c>{sysrq_handle_crash+0} RSP <000001011899dec0>
CR2: 0000000000000000
CPU#0 is executing netdump.
CPU#1 is frozen.
CPU#2 is frozen.
CPU#3 is frozen.
< netdump activated - performing handshake with the server. > <1>Unable to handle kernel NULL pointer dereference at 0000000000000036 RIP:
<1><ffffffffa00c4cea>{:bnx2:bnx2_interrupt+9}
PML4 117110067 PGD 116e82067 PMD 0
<0>Oops: 0000 [2] SMP
CPU 0
Modules linked in: mptctl mptbase sg ipmi_si(U) ipmi_devintf(U)
ipmi_msghandler(U) md5 ipv
6 parport_pc lp parport netconsole netdump autofs4 i2c_dev i2c_core
hp_ilo(U) sunrpc 8021q  ds yenta_socket pcmcia_core joydev button battery ac ehci_hcd uhci_hcd hw_random bnx2(U) dm_snapshot dm_zero dm_mirror ext3 jbd dm_mod ata_piix libata cciss(U) sd_mod scsi_mod
Pid: 27256, comm: bash Not tainted 2.6.9-67.ELsmp
RIP: 0010:[<ffffffffa00c4cea>] <ffffffffa00c4cea>{:bnx2:bnx2_interrupt+9}
RSP: 0018:0000010129fd4890  EFLAGS: 00010046
RAX: 000001012b852240 RBX: 0000010129914380 RCX: 0000000000000006
RDX: 0000000030687465 RSI: 0000010129914000 RDI: 000000000000004a
RBP: 0000010129914000 R08: 0000000000000000 R09: 0000010129914000
R10: 0000010129914380 R11: 0000000000000000 R12: 0000010129914000
R13: 000001011899de18 R14: 000001011f24c4c0 R15: 000001011899de18
FS:  0000002a955773e0(0000) GS:ffffffff804f2d80(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000036 CR3: 0000000000101000 CR4: 00000000000006e0 Process bash (pid: 27256, threadinfo 000001011899c000, task 0000010122249030)
Stack: 0000010129914000 ffffffffa00cb1c5 ffffffffa00cb194 0000000000000000
       000001011899de18 ffffffff802c0119 0000010129914000 0000010117010e40
       000001012992fd80 ffffffff802c008a Call Trace:<ffffffffa00cb1c5>{:bnx2:poll_bnx2+49}
<ffffffffa00cb194>{:bnx2:poll_bnx2+0}
       <ffffffff802c0119>{netpoll_poll_dev+92}
<ffffffff802c008a>{netpoll_send_skb+340}
       <ffffffffa018d4f2>{:netdump:netpoll_netdump+308}
<ffffffff8023c40c>{sysrq_handle_cr
ash+0}
       <ffffffffa018d39a>{:netdump:netpoll_start_netdump+221}


Code: 41 0f b7 40 36 48 8b 7a 08 0f b7 c0 3b 46 20 75 10 48 8b 02 <1>RIP <ffffffffa00c4cea>{:bnx2:bnx2_interrupt+9} RSP <0000010129fd4890>
<0>CR2: 0000000000000036
<3>netpoll_start_netdump: called recursively.  rebooting.

Comment 1 Neil Horman 2008-11-10 16:29:52 UTC
There have been several bnx2 patches since 4.7 GOLD released.  Could you please try with the latest 4.8 beta and confirm that the problem still exists.  Thanks!

Comment 2 Joe T 2008-11-10 22:27:04 UTC
Issue no longer seen with latest test kernel: kernel-2.6.9-78.15.EL.gtest.49.x86_64. Closing BZ.

BTW, Neil, thanks for the pointer.

Comment 3 Joe T 2008-11-11 00:04:37 UTC
Disregard above comment from me. We are still seeing the same behavior w/ latest bnx2 driver (1.8.1f). 

Previously used boxed bnx2 (v1.6.9) which is in: kernel-2.6.9-78.15.EL.gtest.49.x86_64....which works.

Comment 4 Neil Horman 2008-11-11 12:37:16 UTC
I'm sorry, you need to clarify what you are doing here:

In  comment #2 and comment #3 you indicated that kernel-2.6.9-78.15.EL.gtest.49.x86_64 works properly, which uses bnx2 version 1.6.9

The latest RHEL5 kernel (newer even than the beta) uses driver version 1.7.9

Yet you say you see the problem on  bnx2 version 1.8.1f. What exactly are you testing with?

Comment 5 Neil Horman 2008-11-11 12:37:16 UTC
I'm sorry, you need to clarify what you are doing here:

In  comment #2 and comment #3 you indicated that kernel-2.6.9-78.15.EL.gtest.49.x86_64 works properly, which uses bnx2 version 1.6.9

The latest RHEL5 kernel (newer even than the beta) uses driver version 1.7.9

Yet you say you see the problem on  bnx2 version 1.8.1f. What exactly are you testing with?

Comment 6 Neil Horman 2008-11-11 13:38:21 UTC
As a clarification, I mentioned that the rhel5 kernel above used driver version 1.7.9 to show where rhel is in relation to your testing.  RHEL4 (which this bug is against), uses version 1.6.9

Comment 7 Joe T 2008-11-11 19:44:22 UTC
Issue(s) have been isolated and fixes incorporated into bnx2 v1.8.1g-test. Netdump functionality has been restored w/ bnx2 v1.8.1g-test4.

Neil:
Sorry for the confusion.

NH> Yet you say you see the problem on bnx2 version 1.8.1f. What exactly are you
testing with?

We saw the issue w/ bnx2 v1.8.1f on RHEL 4.7 Gold x86-64. With bnx2 v1.8.1g-test the issue is no longer seen. Netdump now functions w/ bnx2 v1.8.1f under RHEL 4.7 Gold.

Comment 8 Michael Chan 2008-11-11 20:03:17 UTC
Neil, the patch you posted on netdev this morning fixed the issue for us.  Thanks a lot.

Comment 9 Neil Horman 2008-11-11 20:44:58 UTC
Ahh, ok, that clarifies a good deal.  so this isn't actually a 4.7 bug (since the problem I fixed upstream doesn't exist in  RHEL4).  Rather you were testing your latest driver dropped into the RHEL4 source tree and noting this problem.  So what you actually need is for my patch to be backported to RHEL5.  Is that right?

Comment 10 Michael Chan 2008-11-11 21:11:50 UTC
That's right.  Although I'm not sure if RHEL5 will have the same problem or not.  It depends on whether Andy has back-ported all that IRQ rework to support MSI-X and multiple RX/TX rings.

Comment 11 Neil Horman 2008-11-12 11:40:29 UTC
It needs a variant of this patch.  The RHEL5 bnx2 driver uses dummy net_device structs to simulate mulitqueue rx but poll_napi only checks the polled device for rx so we repair that

Comment 13 Don Zickus 2008-12-02 22:19:43 UTC
in kernel-2.6.18-125.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Comment 15 Joe T 2008-12-09 03:25:07 UTC
So far we haven't encountered any issues; however, we need to rerun tests with jumbo frames enabled.

Comment 17 Chris Ward 2008-12-17 12:21:54 UTC
Broadcom, test status?

Comment 18 Joe T 2008-12-17 19:21:02 UTC
We found an issue while under Xen. A kernel panic will occur if the following conditions are met:

1) Jumbo frames are enabled on both ports of a 5709C
2) Both interfaces are brought up.
3) Driver is unloaded from memory (rmmod)
   Kernel panic will occur

Comment 20 Michael Chan 2008-12-17 19:36:31 UTC
Joe, is this related to netdump or a separate issue?

Comment 21 Neil Horman 2008-12-17 20:09:24 UTC
The  findings in comment #18 look like a separate issue to me, and needs its own bug.   I believe mpg is opening it now.

Comment 22 Joe T 2008-12-17 20:21:17 UTC
MChan: This is unrelated as Netdump isn't enabled when the kernel panic is observed. Please see CQ38955 (internal Broadcom issue tracking DB).

Will open a separate issue in BZ; unless "mpg" has already opened a BZ. This issue with Xen (unrelated to Netdump) was also seen on SLES 10 SP2.

Comment 23 Mike Gahagan 2008-12-17 20:22:31 UTC
I have opened bz 476897 to address the panic noted in comment #18.

Comment 25 Chris Ward 2008-12-18 05:32:13 UTC
Broadcom, please confirm that this bug fix has been Verified, but that a new issue has been encountered that will be addressed in a future release (bug 476897)? Thanks for clarifying for me.

Comment 29 errata-xmlrpc 2009-01-20 20:02:58 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2009-0225.html