Bug 527656 - bnx2x fails when iptables is on
Summary: bnx2x fails when iptables is on
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 4
Classification: Red Hat
Component: kernel
Version: 4.8
Hardware: All
OS: Linux
urgent
medium
Target Milestone: rc
: ---
Assignee: Stanislaw Gruszka
QA Contact: Network QE
URL:
Whiteboard:
Depends On:
Blocks: 537013
TreeView+ depends on / blocked
 
Reported: 2009-10-07 07:22 UTC by Veaceslav Falico
Modified: 2018-11-14 20:29 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2011-02-16 15:46:38 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
log provided by HP (8.56 KB, application/octet-stream)
2009-10-13 00:41 UTC, Takuma Umeya
no flags Details
bnx2x: Changing the Disabled state to a flag (3.41 KB, application/x-gzip)
2009-10-14 19:39 UTC, Eilon Greenstein
no flags Details
netxtreme2-5.0.17-dcc-fixes.patch (14.56 KB, patch)
2009-10-15 09:59 UTC, Stanislaw Gruszka
no flags Details | Diff
message provided by HP (182.71 KB, text/plain)
2009-10-23 02:20 UTC, Takuma Umeya
no flags Details
logs.debug.bad.gz (51.66 KB, application/x-gzip)
2009-10-27 15:59 UTC, Stanislaw Gruszka
no flags Details
logs.debug.good.gz (32.37 KB, application/x-gzip)
2009-10-27 16:01 UTC, Stanislaw Gruszka
no flags Details
CSUM is always set when GSO is set (942 bytes, patch)
2009-10-28 16:50 UTC, Eilon Greenstein
no flags Details | Diff
Set CSUM on GSO (769 bytes, patch)
2009-10-28 17:44 UTC, Eilon Greenstein
no flags Details | Diff
set_csum_on_gso for RHEL4 (485 bytes, patch)
2009-10-29 06:51 UTC, Stanislaw Gruszka
no flags Details | Diff


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2011:0263 0 normal SHIPPED_LIVE Important: Red Hat Enterprise Linux 4.9 kernel security and bug fix update 2011-02-16 15:14:55 UTC

Description Veaceslav Falico 2009-10-07 07:22:51 UTC
Description of problem:
When iptables is enabled, network communication with
bnx2x fails.

The following messages are logged.

Oct  2 15:56:06 localhost kernel: [bnx2x_stats_update:4549(eth0)]storm stats were not updated for 3 times
Oct  2 15:56:06 localhost kernel: [bnx2x_stats_update:4550(eth0)]driver assert
Oct  2 15:56:06 localhost kernel: [bnx2x_panic_dump:632(eth0)]begin
crash dump -----------------
Oct  2 15:56:06 localhost kernel: [bnx2x_panic_dump:640(eth0)]def_c_idx(3240)  def_u_idx(0)  def_x_idx(0)  def_t_idx(0)  def_att_idx(6)  attn_state(0)  spq_prod_idx(171)
Oct  2 15:56:06 localhost kernel: [bnx2x_panic_dump:651(eth0)]fp0: rx_bd_prod(9154)  rx_bd_cons(156)  *rx_bd_cons_sb(0)  rx_comp_prod(92dd)  rx_comp_cons(82dd) *rx_cons_sb(82dd)
Oct  2 15:56:06 localhost kernel: [bnx2x_panic_dump:656(eth0)]     rx_sge_prod(400)  last_max_sge(0) fp_u_idx(15fa) *sb_u_idx(15fa)
Oct  2 15:56:06 localhost kernel: [bnx2x_panic_dump:666(eth0)]fp1: tx_pkt_prod(a717)  tx_pkt_cons(a707) tx_bd_prod(fae1)  tx_bd_cons(faab)  *tx_cons_sb(a707)
Oct  2 15:56:06 localhost kernel: [bnx2x_panic_dump:670(eth0)]     fp_c_idx(9718)  *sb_c_idx(9718)  tx_db_prod(fae1)
Oct  2 15:56:06 localhost kernel: [bnx2x_panic_dump:685(eth0)]fp0: rx_bd[2d3]=[0:34fe7010]  sw_bd=[f46a5c00]
Oct  2 15:56:06 localhost kernel: [bnx2x_panic_dump:685(eth0)]fp0: rx_bd[2d4]=[0:347bc010]  sw_bd=[f5cf2800]


Version-Release number of selected component (if applicable):


How reproducible:
This problem is reproducible on BL460c G6.

1)  create /etc/sysconfig/iptables as the following
# cat /etc/sysconfig/iptables
# Generated by iptables-save v1.2.11 on Fri Oct  2 15:54:04 2009
*nat
:PREROUTING ACCEPT [14:2015]
:POSTROUTING ACCEPT [3:628]
:OUTPUT ACCEPT [3:628]
-A POSTROUTING -s 192.168.67.0/255.255.255.0 -p tcp -m tcp --sport 20
-j MASQUERADE
COMMIT
# Completed on Fri Oct  2 15:54:04 2009

2) service iptables start

3) ftp xxxxx
ftp > put <file>  ; file bigger than 1MB

problem occurs

If above problem occurs,  network communication with bnx2x NIC will
fail.
To recover,  it is necessary to do the below command.
# rmmod bnx2x ;  service iptables off; service network restart



  
Actual results:
Network stalls, bnx2x druver asserts.

Expected results:
Normal workflow.


Additional info:
As per adjacent bzs, suggested switching off TSO. will see if that helps.

Comment 1 Stanislaw Gruszka 2009-10-07 12:43:27 UTC
> This problem is reproducible on BL460c G6.

How can I get access to this machine ?

Comment 2 Stanislaw Gruszka 2009-10-07 13:14:23 UTC
> The following messages are logged.
> 
> Oct  2 15:56:06 localhost kernel: [bnx2x_stats_update:4549(eth0)]storm stats
> were not updated for 3 times

We have no such message in RHEL4 driver, it is in RHEL5 however.

> Version-Release number of selected component (if applicable):

This information is missing. It looks like newer Broadcom module is used instead of one shipped with RHEL4, correct ? Otherwise it is RHEL5 not RHEL4 ?

Comment 3 Stanislaw Gruszka 2009-10-07 13:21:07 UTC
Taken form sosreport:

Broadcom NetXtreme II 5771x 10Gigabit Ethernet Driver bnx2x 1.50.13 ($DateTime: 2009/07/22 07:22:59 $)

Hence bnx2x is external module. I guess Broadcom should support this, that would be the best way as this is very low level bug.

Comment 7 Issue Tracker 2009-10-08 10:01:13 UTC
Event posted on 10-08-2009 07:01pm JST by tumeya

Apology for this haven't checked in-shop. 
I've confirmed the simple reproducer again, ruling out few things from
their original report. 


1. run modprobe:
   # modprobe iptable_nat
   # lsmod | grep ip
   iptable_nat            27613  0
   ip_conntrack           46085  1 iptable_nat
   ip_tables              23105  1 iptable_nat
   ipv6                  244833  26 #<---not relevant but anyways...

2. Confirm the rule is blank:
   # iptables -t nat -L
   Chain PREROUTING (policy ACCEPT)
   target     prot opt source               destination         

   Chain POSTROUTING (policy ACCEPT)
   target     prot opt source               destination         

   Chain OUTPUT (policy ACCEPT)
   target     prot opt source               destination         

3. ftp to any box; put 1MB+ file; network hangs. 


I've pinged HP and they said this would cause the same issue they are
seeing. 


This event sent from IssueTracker by tumeya 
 issue 350828

Comment 8 Stanislaw Gruszka 2009-10-08 10:19:06 UTC
Eilon, 

We have problem with bnx2x working in RHEL4. Issue is also reproducible with (newest ?) driver version 1.50.13. Any help will be very appreciated. If you want more info please let us know.

Comment 9 Eilon Greenstein 2009-10-08 10:58:33 UTC
I will try to build a similar setup locally and reproduce. Stay tuned.

Comment 10 Eilon Greenstein 2009-10-08 16:27:44 UTC
I could not reproduce on a different system. I’m trying to get BL460 to test again.

Comment 11 Stanislaw Gruszka 2009-10-08 17:02:58 UTC
(In reply to comment #10)
> I could not reproduce on a different system. I’m trying to get BL460 to test
> again.  

Thank you for your effort.

Here are some more details about configuration:

BL460c G6
RHEL4.8 (2.6.9-89.ELsmp) 32bit (i386)
BCM57711E 100/1GB/10GB NIC

We do not have info about connected switch, do you think this is important?

Comment 12 Eilon Greenstein 2009-10-08 19:06:33 UTC
I think that there might be an earlier error prior to the statistics collection failure.

Can you please enable the following debug prints 0xef00f7 and send the log?

Thanks,
Eilon

Comment 16 Takuma Umeya 2009-10-13 00:41:33 UTC
Created attachment 364536 [details]
log provided by HP

Comment 17 Stanislaw Gruszka 2009-10-14 11:37:42 UTC
This log seems to be not sufficient. I think log for 1.50.13 driver is needed (this one seems to be for RHEL4 stock driver), log should start when bnx2x module is loaded and ending after bnx2x_panic_dump or even better after rmmod. 

Eilon, do I'm right ?

Comment 18 Eilon Greenstein 2009-10-14 19:39:26 UTC
Created attachment 364796 [details]
bnx2x: Changing the Disabled state to a flag

Hi,

Indeed, the log is very limited and it is missing the prints I was hopping to see – especially the FW dump. Though I’m still unable to reproduce, I was able to reproduce and fix (at least I hop so ;)) a race condition when loading and unloading a driver on an HP system with DCC (device control channel) enabled.

I posted a full fix to the DCC issues, which includes fixes to 2 other issues, on netdev. The problem is only seen with DCC which is part of version 1.50.13 but not part of the current RH4.8 version.

I think that this issue is addressed in one of the fixes which in patches 4,5 or 6 – but I don’t have enough information to know for sure. If one of those race conditions will happen, eventually statistics collection will stop. If possible, please try to run with those patches. I’m attaching them here as well.

Thanks,
Eilon

Comment 20 Stanislaw Gruszka 2009-10-15 09:59:11 UTC
Created attachment 364886 [details]
netxtreme2-5.0.17-dcc-fixes.patch

Eilon, your patches do not apply, even if I force to apply then compilation fail. I fixed that. Please check if this patch contains all intended fixed. Patch is for form Broadcom site.

Comment 22 Eilon Greenstein 2009-10-15 10:58:28 UTC
Sorry about that, it was based on net-next  - this patch looks good.

Comment 27 Takuma Umeya 2009-10-23 02:20:41 UTC
Created attachment 365807 [details]
message provided by HP

Comment 31 Stanislaw Gruszka 2009-10-27 15:55:02 UTC
Current status:

DCC fixes do not help with this bug.

I get access to BL460c G6 in RedHat and I'm able to reproduce the bug. It's enough to load iptable_nat module (without any iptables settings) and transmit data to reproduce.

Comment 32 Stanislaw Gruszka 2009-10-27 15:59:02 UTC
Created attachment 366296 [details]
logs.debug.bad.gz

Logs when iptable_nat is loaded, with bnx2x debug=0xffffff and stack dumps from iptables code from functions ip_nat_fn(), ip_nat_out().

Comment 33 Stanislaw Gruszka 2009-10-27 16:01:29 UTC
Created attachment 366297 [details]
logs.debug.good.gz

Logs when transmitting data without iptable_nat module, with bnx2x debug=0xffffff

Comment 34 Stanislaw Gruszka 2009-10-27 16:07:51 UTC
With iptable_nat TSO is used (xmit_type 8) and this seems to make firmware crash, as we can not see TSO transmissions in "good" logs (we have only xmit_type 0,1,5).

Eilon, any hints?

Comment 35 Eilon Greenstein 2009-10-28 15:01:41 UTC
I’m able to reproduce now and I’m looking into it. Stay tuned…

Comment 36 Eilon Greenstein 2009-10-28 16:50:41 UTC
Created attachment 366466 [details]
CSUM is always set when GSO is set

Apparently, when iptable_nat is loaded, we receive request for GSO without explicit CSUM offload request. This patch take care of the CSUM when GSO is set. This patch fixes the problem on my setup – please let me know the status on yours.

Comment 37 Eilon Greenstein 2009-10-28 17:22:23 UTC
The patch works, but it is not good enough. I will work on something cleaner and send it out.

Comment 38 Eilon Greenstein 2009-10-28 17:44:38 UTC
Created attachment 366473 [details]
Set CSUM on GSO

This fix is more like it. Sorry for the mess

Comment 39 Issue Tracker 2009-10-29 06:04:49 UTC
Event posted on 10-29-2009 03:04pm JST by tumeya

It's confirmed that the patch works. The issue is gone. 


This event sent from IssueTracker by tumeya 
 issue 350828

Comment 41 Stanislaw Gruszka 2009-10-29 06:48:53 UTC
I also confirm fix works in my setup. Many, many thanks Eilon.

Comment 42 Stanislaw Gruszka 2009-10-29 06:51:20 UTC
Created attachment 366563 [details]
set_csum_on_gso for RHEL4

Patch for RHEL4. I'm going to prepare kernel packages now.

Comment 43 Stanislaw Gruszka 2009-10-29 08:41:33 UTC
Brew build:
https://brewweb.devel.redhat.com/taskinfo?taskID=2052216

Test packages (i686, x86_64, src.rpm) for public download:
http://people.redhat.com/sgruszka/bnx2x-rhel4/

Comment 47 Stanislaw Gruszka 2009-11-03 08:22:20 UTC
Eilon, is patch queued for upstream submission? I need that info before posting to RKML.

Comment 48 Eilon Greenstein 2009-11-03 09:38:24 UTC
Hi,

I’m sorry about the delay, but I’m under the gun for another project… Yes – I plan to submit this patch to Dave Miller’s net-next soon (within days).

BTW - Any idea why iptable_nat is not setting the CHECKSUM_HW in the skb->ip_summed? The reason will not effect this patch, since the bnx2x should always configure the FW/HW for checksum offload when setting GSO, but it still interesting to know.

Thanks,
Eilon

Comment 49 Stanislaw Gruszka 2009-11-03 13:56:34 UTC
(In reply to comment #48)
> BTW - Any idea why iptable_nat is not setting the CHECKSUM_HW in the
> skb->ip_summed? 

I guess only because want to have the same code for incoming/outgoing/forwarding packets and avoid compilations with pseudo header. I do not see any other reason, but I don't know netfilter code very well. Anyway, upstream checksuming in netfilter looks better.

Comment 52 RHEL Program Management 2009-11-05 18:11:28 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 57 Vivek Goyal 2009-11-09 20:16:22 UTC
Committed in 89.15.EL . RPMS are available at http://people.redhat.com/vgoyal/rhel4/

Comment 66 errata-xmlrpc 2011-02-16 15:46:38 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2011-0263.html


Note You need to log in before you can comment on or make changes to this bug.