Description of problem: Netdump not functioning w/ bnx2 >= v1.8h (Broadcom Netxtreme II Network Card): System under test (where Netdump is running), does not send a memory dump to a remote Netdump-server prior to rebooting. Our developers for bnx2, have looked into the issue. Their conclusion at this time is the issue is a Netdump bug or limitation. Version-Release number of selected component (if applicable): Netdump 0.7.16-14 (RHEL 4.7 Gold) How reproducible: 1) USe RHEL 4.7 Gold x86-64 2) Netdump (0.7.16-14); included with RHEL 4.7 Gold 3) Server system w/ a 5708 LoM (LAN on Motherboard) ie: HP DL 385G5 Dell PowerEdge 1900 Steps to Reproduce: 1. Configure netdump a) vi /etc/sysconfig/netdump b) Add: DEV=eth0 NETDUMPADDR=172.16.x.x c) save and exit 2. Start netdump daemon: a) service netdump start b) accept unverfied key (yes) c) enter password netdump user 3. Crash the system under test (SUT) a) echo c > /proc/sysrq-tigger Actual results: SUT does not send memory dump to a remote server prior to rebooting. <0>CR2: 0000000000000036 <3>netpoll_start_netdump: called recursively. rebooting. See "Full Trace" below Expected results: Netdump should dump memory of crashed system to a remote server over a LAN Full Trace: SysRq : Crashing the kernel by request Unable to handle kernel NULL pointer dereference at 0000000000000000 RIP: <ffffffff8023c40c>{sysrq_handle_crash+0} PML4 117110067 PGD 116e82067 PMD 0 Oops: 0002 [1] SMP CPU 0 Modules linked in: mptctl mptbase sg ipmi_si(U) ipmi_devintf(U) ipmi_msghandler(U) md5 ipv 6 parport_pc lp parport netconsole netdump autofs4 i2c_dev i2c_core hp_ilo(U) sunrpc 8021q ds yenta_socket pcmcia_core joydev button battery ac ehci_hcd uhci_hcd hw_random bnx2(U) dm_snapshot dm_zero dm_mirror ext3 jbd dm_mod ata_piix libata cciss(U) sd_mod scsi_mod Pid: 27256, comm: bash Not tainted 2.6.9-67.ELsmp RIP: 0010:[<ffffffff8023c40c>] <ffffffff8023c40c>{sysrq_handle_crash+0} RSP: 0018:000001011899dec0 EFLAGS: 00010216 RAX: 000000000000001f RBX: ffffffff80406f40 RCX: ffffffff803e95e8 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000063 RBP: 0000000000000063 R08: ffffffff803e95e8 R09: ffffffff80406f40 R10: 0000000100000000 R11: 0000000000000000 R12: 0000000000000000 R13: 0000000000000000 R14: 0000000000000006 R15: 0000000000000000 FS: 0000002a955773e0(0000) GS:ffffffff804f2d80(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 0000000000000000 CR3: 0000000000101000 CR4: 00000000000006e0 Process bash (pid: 27256, threadinfo 000001011899c000, task 0000010122249030) Stack: ffffffff8023c5c2 000001011899c000 0000000000000002 000001011899df50 0000000000000002 0000002a984d3000 ffffffff801b2741 0000000000000048 0000000000000000 000001011bd13ec0 Call Trace:<ffffffff8023c5c2>{__handle_sysrq+102} <ffffffff801b2741>{write_sysrq_trigger+4 3} <ffffffff8017af0e>{vfs_write+207} <ffffffff8017aff6>{sys_write+69} <ffffffff8011026a>{system_call+126} Code: c6 04 25 00 00 00 00 00 c3 e9 1c ff f3 ff e9 55 4d f4 ff 48 RIP <ffffffff8023c40c>{sysrq_handle_crash+0} RSP <000001011899dec0> CR2: 0000000000000000 CPU#0 is executing netdump. CPU#1 is frozen. CPU#2 is frozen. CPU#3 is frozen. < netdump activated - performing handshake with the server. > <1>Unable to handle kernel NULL pointer dereference at 0000000000000036 RIP: <1><ffffffffa00c4cea>{:bnx2:bnx2_interrupt+9} PML4 117110067 PGD 116e82067 PMD 0 <0>Oops: 0000 [2] SMP CPU 0 Modules linked in: mptctl mptbase sg ipmi_si(U) ipmi_devintf(U) ipmi_msghandler(U) md5 ipv 6 parport_pc lp parport netconsole netdump autofs4 i2c_dev i2c_core hp_ilo(U) sunrpc 8021q ds yenta_socket pcmcia_core joydev button battery ac ehci_hcd uhci_hcd hw_random bnx2(U) dm_snapshot dm_zero dm_mirror ext3 jbd dm_mod ata_piix libata cciss(U) sd_mod scsi_mod Pid: 27256, comm: bash Not tainted 2.6.9-67.ELsmp RIP: 0010:[<ffffffffa00c4cea>] <ffffffffa00c4cea>{:bnx2:bnx2_interrupt+9} RSP: 0018:0000010129fd4890 EFLAGS: 00010046 RAX: 000001012b852240 RBX: 0000010129914380 RCX: 0000000000000006 RDX: 0000000030687465 RSI: 0000010129914000 RDI: 000000000000004a RBP: 0000010129914000 R08: 0000000000000000 R09: 0000010129914000 R10: 0000010129914380 R11: 0000000000000000 R12: 0000010129914000 R13: 000001011899de18 R14: 000001011f24c4c0 R15: 000001011899de18 FS: 0000002a955773e0(0000) GS:ffffffff804f2d80(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 0000000000000036 CR3: 0000000000101000 CR4: 00000000000006e0 Process bash (pid: 27256, threadinfo 000001011899c000, task 0000010122249030) Stack: 0000010129914000 ffffffffa00cb1c5 ffffffffa00cb194 0000000000000000 000001011899de18 ffffffff802c0119 0000010129914000 0000010117010e40 000001012992fd80 ffffffff802c008a Call Trace:<ffffffffa00cb1c5>{:bnx2:poll_bnx2+49} <ffffffffa00cb194>{:bnx2:poll_bnx2+0} <ffffffff802c0119>{netpoll_poll_dev+92} <ffffffff802c008a>{netpoll_send_skb+340} <ffffffffa018d4f2>{:netdump:netpoll_netdump+308} <ffffffff8023c40c>{sysrq_handle_cr ash+0} <ffffffffa018d39a>{:netdump:netpoll_start_netdump+221} Code: 41 0f b7 40 36 48 8b 7a 08 0f b7 c0 3b 46 20 75 10 48 8b 02 <1>RIP <ffffffffa00c4cea>{:bnx2:bnx2_interrupt+9} RSP <0000010129fd4890> <0>CR2: 0000000000000036 <3>netpoll_start_netdump: called recursively. rebooting.
There have been several bnx2 patches since 4.7 GOLD released. Could you please try with the latest 4.8 beta and confirm that the problem still exists. Thanks!
Issue no longer seen with latest test kernel: kernel-2.6.9-78.15.EL.gtest.49.x86_64. Closing BZ. BTW, Neil, thanks for the pointer.
Disregard above comment from me. We are still seeing the same behavior w/ latest bnx2 driver (1.8.1f). Previously used boxed bnx2 (v1.6.9) which is in: kernel-2.6.9-78.15.EL.gtest.49.x86_64....which works.
I'm sorry, you need to clarify what you are doing here: In comment #2 and comment #3 you indicated that kernel-2.6.9-78.15.EL.gtest.49.x86_64 works properly, which uses bnx2 version 1.6.9 The latest RHEL5 kernel (newer even than the beta) uses driver version 1.7.9 Yet you say you see the problem on bnx2 version 1.8.1f. What exactly are you testing with?
As a clarification, I mentioned that the rhel5 kernel above used driver version 1.7.9 to show where rhel is in relation to your testing. RHEL4 (which this bug is against), uses version 1.6.9
Issue(s) have been isolated and fixes incorporated into bnx2 v1.8.1g-test. Netdump functionality has been restored w/ bnx2 v1.8.1g-test4. Neil: Sorry for the confusion. NH> Yet you say you see the problem on bnx2 version 1.8.1f. What exactly are you testing with? We saw the issue w/ bnx2 v1.8.1f on RHEL 4.7 Gold x86-64. With bnx2 v1.8.1g-test the issue is no longer seen. Netdump now functions w/ bnx2 v1.8.1f under RHEL 4.7 Gold.
Neil, the patch you posted on netdev this morning fixed the issue for us. Thanks a lot.
Ahh, ok, that clarifies a good deal. so this isn't actually a 4.7 bug (since the problem I fixed upstream doesn't exist in RHEL4). Rather you were testing your latest driver dropped into the RHEL4 source tree and noting this problem. So what you actually need is for my patch to be backported to RHEL5. Is that right?
That's right. Although I'm not sure if RHEL5 will have the same problem or not. It depends on whether Andy has back-ported all that IRQ rework to support MSI-X and multiple RX/TX rings.
It needs a variant of this patch. The RHEL5 bnx2 driver uses dummy net_device structs to simulate mulitqueue rx but poll_napi only checks the polled device for rx so we repair that
in kernel-2.6.18-125.el5 You can download this test kernel from http://people.redhat.com/dzickus/el5
So far we haven't encountered any issues; however, we need to rerun tests with jumbo frames enabled.
Broadcom, test status?
We found an issue while under Xen. A kernel panic will occur if the following conditions are met: 1) Jumbo frames are enabled on both ports of a 5709C 2) Both interfaces are brought up. 3) Driver is unloaded from memory (rmmod) Kernel panic will occur
Joe, is this related to netdump or a separate issue?
The findings in comment #18 look like a separate issue to me, and needs its own bug. I believe mpg is opening it now.
MChan: This is unrelated as Netdump isn't enabled when the kernel panic is observed. Please see CQ38955 (internal Broadcom issue tracking DB). Will open a separate issue in BZ; unless "mpg" has already opened a BZ. This issue with Xen (unrelated to Netdump) was also seen on SLES 10 SP2.
I have opened bz 476897 to address the panic noted in comment #18.
Broadcom, please confirm that this bug fix has been Verified, but that a new issue has been encountered that will be addressed in a future release (bug 476897)? Thanks for clarifying for me.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2009-0225.html