Bug 484667 - Dropping packets in bnx2 since 1.7.9 bnx2 version
Summary: Dropping packets in bnx2 since 1.7.9 bnx2 version
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 4
Classification: Red Hat
Component: kernel
Version: 4.8
Hardware: All
OS: Linux
low
medium
Target Milestone: rc
: ---
Assignee: Andy Gospodarek
QA Contact: Martin Jenner
URL:
Whiteboard:
: 480693 488749 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2009-02-09 13:03 UTC by Marcus Alves Grando
Modified: 2018-11-14 13:02 UTC (History)
11 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2009-05-18 19:35:16 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
bnx2-fixup-problems-with-netpoll-implementation.patch (4.17 KB, patch)
2009-02-11 22:49 UTC, Andy Gospodarek
no flags Details | Diff
bnx2-revert-multiqueue-support-and-fix-netdump.patch (52.88 KB, patch)
2009-02-13 22:13 UTC, Andy Gospodarek
no flags Details | Diff
bnx2-final-working-and-tested.patch (53.80 KB, patch)
2009-02-16 17:43 UTC, Andy Gospodarek
no flags Details | Diff


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2009:1024 0 normal SHIPPED_LIVE Important: Red Hat Enterprise Linux 4.8 kernel security and bug fix update 2009-05-18 14:57:26 UTC

Description Marcus Alves Grando 2009-02-09 13:03:41 UTC
Hello Andy,

Since bnx2 update to 1.7.9, all my servers dropping packets randomly.

I've test with/without TSO and rx-checksumming and produce same behaviour.

[12:49:52] root@mail-fe05(temora):~# ifconfig eth1 | grep pack
          RX packets:5795441 errors:0 dropped:1922 overruns:0 frame:0
          TX packets:5054322 errors:0 dropped:0 overruns:0 carrier:0
[12:50:48] root@mail-fe05(temora):~# ifconfig eth1 | grep pack
          RX packets:6162389 errors:0 dropped:2294 overruns:0 frame:0
          TX packets:5373380 errors:0 dropped:0 overruns:0 carrier:0

[12:57:13] root@mail-fe05(temora):~# uname -a
Linux temora.hst.terra.com.br 2.6.9-80.0.2.ELsmp #1 SMP Sun Feb 8 15:03:49 UTC 2009 i686 athlon i386 GNU/Linux

[12:59:35] root@mail-fe05(temora):~# ethtool -i eth1
driver: bnx2
version: 1.7.9-1
firmware-version: 1.9.6
bus-info: 0000:07:05.0
[12:59:39] root@mail-fe05(temora):~# ethtool -i eth0
driver: bnx2
version: 1.7.9-1
firmware-version: 1.9.6
bus-info: 0000:06:04.0

[12:59:46] root@mail-fe05(temora):~# lspci | grep -i eth
03:04.0 Ethernet controller: Broadcom Corporation NetXtreme II BCM5706 Gigabit Ethernet (rev 02)
04:05.0 Ethernet controller: Broadcom Corporation NetXtreme II BCM5706 Gigabit Ethernet (rev 02)
06:04.0 Ethernet controller: Broadcom Corporation NetXtreme II BCM5706 Gigabit Ethernet (rev 02)
07:05.0 Ethernet controller: Broadcom Corporation NetXtreme II BCM5706 Gigabit Ethernet (rev 02)
41:01.0 Ethernet controller: Broadcom Corporation NetXtreme II BCM5706 Gigabit Ethernet (rev 02)
41:02.0 Ethernet controller: Broadcom Corporation NetXtreme II BCM5706 Gigabit Ethernet (rev 02)

Extra patches:

[13:01:07] root@mail-fe05(temora):~# rpm -q --changelog kernel-smp-2.6.9-80.0.2.EL | more
* Sun Feb 08 2009 Marcus Alves Grando <marcus.grando.br> [TERRA VERSION]

-kernel: Enable REISERFS and XFS modules
-kernel: Change default kernel HZ to 250
-relatime: Relative atime updates (default: off)
-e1000: descriptor ring dump
-e1000: msi test and switch to intx
-bonding: keep all traffic when inactive device in promiscuous mode
-bnx2: fixup poll_controller routine
-bnx2: enable netdump again
-bnx2: fixup needed to allow netdump operation to complete
-bnx2: enable netdump again

* Fri Jan 23 2009 Vivek Goyal <vgoyal> [2.6.9-80]
...

Something to test?

Regards

Comment 1 Marcus Alves Grando 2009-02-09 13:34:00 UTC
Andy,

After those patches, work worst.

-bnx2: fixup poll_controller routine
-bnx2: enable netdump again
-bnx2: fixup needed to allow netdump operation to complete
-bnx2: enable netdump again

Nic reseting randomly...

NETDEV WATCHDOG: eth1: transmit timed out
bnx2: eth1 NIC Copper Link is Down
bnx2: eth1 NIC Copper Link is Up, 1000 Mbps full duplex
NETDEV WATCHDOG: eth1: transmit timed out
bnx2: eth1 NIC Copper Link is Down
bnx2: eth1 NIC Copper Link is Up, 1000 Mbps full duplex
NETDEV WATCHDOG: eth1: transmit timed out
bnx2: eth1 NIC Copper Link is Down
bnx2: eth1 NIC Copper Link is Up, 1000 Mbps full duplex
NETDEV WATCHDOG: eth0: transmit timed out
bnx2: eth0 NIC Copper Link is Down
bnx2: eth0 NIC Copper Link is Up, 1000 Mbps full duplex
NETDEV WATCHDOG: eth0: transmit timed out
bnx2: eth0 NIC Copper Link is Down
bnx2: eth0 NIC Copper Link is Up, 1000 Mbps full duplex

Regards

Comment 2 Andy Gospodarek 2009-02-09 15:14:20 UTC
Marcus, which version was the last one that worked well?

Comment 3 Andy Gospodarek 2009-02-09 15:16:53 UTC
(Sorry to not ask these questions in one comment.)

Marcus, can you send me the output from ethtool -S eth1 and eth0?  I'd like to understand why the frames are being dropped and the ethtool output might give us more information.  I would like to see if 'rx_fw_discards' is incrementing.

Comment 4 Marcus Alves Grando 2009-02-09 16:41:48 UTC
(In reply to comment #3)
> (Sorry to not ask these questions in one comment.)
> 
> Marcus, can you send me the output from ethtool -S eth1 and eth0?  I'd like to
> understand why the frames are being dropped and the ethtool output might give
> us more information.  I would like to see if 'rx_fw_discards' is incrementing.

Yes, sure.

[16:39:56] root@mail-fe05(temora):~# while true; do ethtool -S eth0 | grep disc; ethtool -S eth1 | grep disc; echo "---------"; sleep 10; done
     rx_discards: 0
     rx_fw_discards: 1875
     rx_discards: 0
     rx_fw_discards: 4368
---------
     rx_discards: 0
     rx_fw_discards: 1875
     rx_discards: 0
     rx_fw_discards: 4412
---------
     rx_discards: 0
     rx_fw_discards: 1875
     rx_discards: 0
     rx_fw_discards: 4503

[16:40:59] root@mail-fe05(temora):~# ethtool -S eth0
NIC statistics:
     rx_bytes: 1746844574
     rx_error_bytes: 0
     tx_bytes: 3297469196
     tx_error_bytes: 0
     rx_ucast_packets: 2909104
     rx_mcast_packets: 0
     rx_bcast_packets: 1486
     tx_ucast_packets: 3655832
     tx_mcast_packets: 0
     tx_bcast_packets: 0
     tx_mac_errors: 0
     tx_carrier_errors: 0
     rx_crc_errors: 0
     rx_align_errors: 0
     tx_single_collisions: 0
     tx_multi_collisions: 0
     tx_deferred: 0
     tx_excess_collisions: 0
     tx_late_collisions: 0
     tx_total_collisions: 0
     rx_fragments: 0
     rx_jabbers: 0
     rx_undersize_packets: 0
     rx_oversize_packets: 0
     rx_64_byte_packets: 1228957
     rx_65_to_127_byte_packets: 418654
     rx_128_to_255_byte_packets: 114947
     rx_256_to_511_byte_packets: 5520
     rx_512_to_1023_byte_packets: 137437
     rx_1024_to_1522_byte_packets: 1005075
     rx_1523_to_9022_byte_packets: 0
     tx_64_byte_packets: 959754
     tx_65_to_127_byte_packets: 477018
     tx_128_to_255_byte_packets: 37313
     tx_256_to_511_byte_packets: 36872
     tx_512_to_1023_byte_packets: 78072
     tx_1024_to_1522_byte_packets: 2066803
     tx_1523_to_9022_byte_packets: 0
     rx_xon_frames: 0
     rx_xoff_frames: 0
     tx_xon_frames: 0
     tx_xoff_frames: 0
     rx_mac_ctrl_frames: 0
     rx_filtered_packets: 27327
     rx_discards: 0
     rx_fw_discards: 2278

[16:41:02] root@mail-fe05(temora):~# ethtool -S eth1
NIC statistics:
     rx_bytes: 133768241
     rx_error_bytes: 0
     tx_bytes: 17685864
     tx_error_bytes: 0
     rx_ucast_packets: 137602
     rx_mcast_packets: 1
     rx_bcast_packets: 84
     tx_ucast_packets: 120623
     tx_mcast_packets: 0
     tx_bcast_packets: 0
     tx_mac_errors: 0
     tx_carrier_errors: 0
     rx_crc_errors: 0
     rx_align_errors: 0
     tx_single_collisions: 0
     tx_multi_collisions: 0
     tx_deferred: 0
     tx_excess_collisions: 0
     tx_late_collisions: 0
     tx_total_collisions: 0
     rx_fragments: 0
     rx_jabbers: 0
     rx_undersize_packets: 0
     rx_oversize_packets: 0
     rx_64_byte_packets: 6863
     rx_65_to_127_byte_packets: 2033
     rx_128_to_255_byte_packets: 25042
     rx_256_to_511_byte_packets: 17473
     rx_512_to_1023_byte_packets: 4543
     rx_1024_to_1522_byte_packets: 81733
     rx_1523_to_9022_byte_packets: 0
     tx_64_byte_packets: 47359
     tx_65_to_127_byte_packets: 14982
     tx_128_to_255_byte_packets: 54104
     tx_256_to_511_byte_packets: 2921
     tx_512_to_1023_byte_packets: 191
     tx_1024_to_1522_byte_packets: 1066
     tx_1523_to_9022_byte_packets: 0
     rx_xon_frames: 0
     rx_xoff_frames: 0
     tx_xon_frames: 0
     tx_xoff_frames: 0
     rx_mac_ctrl_frames: 0
     rx_filtered_packets: 50
     rx_discards: 0
     rx_fw_discards: 90

Regards

Comment 5 Marcus Alves Grando 2009-02-09 16:46:32 UTC
(In reply to comment #2)
> Marcus, which version was the last one that worked well?

1.6.9 does not work too. I'll try to find out.

[16:44:33] root@mail-fe01(beleterro):~# while true; do ethtool -S eth0 | grep disc; ethtool -S eth1 | grep disc; echo "---------"; sleep 10; done
     rx_discards: 0
     rx_fw_discards: 7553697
     rx_discards: 0
     rx_fw_discards: 823645
---------
     rx_discards: 0
     rx_fw_discards: 7556198
     rx_discards: 0
     rx_fw_discards: 823788
---------
     rx_discards: 0
     rx_fw_discards: 7556320
     rx_discards: 0
     rx_fw_discards: 823788
[16:45:11] root@mail-fe01(beleterro):~# ethtool -i eth1 
driver: bnx2
version: 1.6.9
firmware-version: 1.9.6
bus-info: 0000:07:05.0

Comment 6 Marcus Alves Grando 2009-02-09 17:22:21 UTC
(In reply to comment #2)
> Marcus, which version was the last one that worked well?

Andy,

I've tried with 2.6.9-67.0.22 and works well. A litte bit strange but much better than >2.6.9-78

[17:17:55] root@mail-fe05(temora):~# ethtool -i eth0
driver: bnx2
version: 1.5.11-rh
firmware-version: 1.9.6
bus-info: 0000:06:04.0

[17:10:29] root@mail-fe05(temora):~# while true; do ethtool -S eth0 | grep disc; ethtool -S eth1 | grep disc; ifconfig | grep dropp; echo "-----------"; sleep 10; done
     rx_discards: 0
     rx_fw_discards: 5537
     rx_discards: 0
     rx_fw_discards: 0
          RX packets:8225543 errors:0 dropped:5537 overruns:0 frame:0
          TX packets:11102060 errors:0 dropped:0 overruns:0 carrier:0
          RX packets:16763930 errors:0 dropped:0 overruns:0 frame:0
          TX packets:14512118 errors:0 dropped:0 overruns:0 carrier:0
          RX packets:4591321 errors:0 dropped:0 overruns:0 frame:0
          TX packets:4591321 errors:0 dropped:0 overruns:0 carrier:0
-----------
     rx_discards: 0
     rx_fw_discards: 5537
     rx_discards: 0
     rx_fw_discards: 0
          RX packets:8302261 errors:0 dropped:5537 overruns:0 frame:0
          TX packets:11197698 errors:0 dropped:0 overruns:0 carrier:0
          RX packets:16873455 errors:0 dropped:0 overruns:0 frame:0
          TX packets:14606980 errors:0 dropped:0 overruns:0 carrier:0
          RX packets:4623614 errors:0 dropped:0 overruns:0 frame:0
          TX packets:4623614 errors:0 dropped:0 overruns:0 carrier:0
-----------
     rx_discards: 0
     rx_fw_discards: 5840
     rx_discards: 0
     rx_fw_discards: 0
          RX packets:8377853 errors:0 dropped:5840 overruns:0 frame:0
          TX packets:11305200 errors:0 dropped:0 overruns:0 carrier:0
          RX packets:17009128 errors:0 dropped:0 overruns:0 frame:0
          TX packets:14726534 errors:0 dropped:0 overruns:0 carrier:0
          RX packets:4658835 errors:0 dropped:0 overruns:0 frame:0
          TX packets:4658835 errors:0 dropped:0 overruns:0 carrier:0
-----------
     rx_discards: 0
     rx_fw_discards: 5840
     rx_discards: 0
     rx_fw_discards: 0
          RX packets:8433314 errors:0 dropped:5840 overruns:0 frame:0
          TX packets:11378551 errors:0 dropped:0 overruns:0 carrier:0
          RX packets:17113454 errors:0 dropped:0 overruns:0 frame:0
          TX packets:14816876 errors:0 dropped:0 overruns:0 carrier:0
          RX packets:4697507 errors:0 dropped:0 overruns:0 frame:0
          TX packets:4697507 errors:0 dropped:0 overruns:0 carrier:0

Maybe those dropped packets in eth0 is something between client and server?

Regards

Comment 7 Andy Gospodarek 2009-02-09 18:02:58 UTC
It looks like 'rx_fw_discards' number matches the number of 'dropped' in ifconfig.  This is what I was expecting.

When the firmware is dropping the frames it is often because large amounts of data are coming into the box and:

- the driver wants to put more in the ring buffer than are available slots

OR

- interrupts are not being serviced fast enough and frames are being dropped  

My first suggestion would be to bump up the size of the ring buffer.  First check the ring buffer setting with 'ethtool -g eth0' and pick a size larger than what is used right now (use ethtool -G eth0 rx [num] to set it).  You could try setting it to the maximum allowed if you want, but the larger it is, the more memory will be consumed by each interface.

If you are still seeing 'rx_fw_discards' then you can modify the coalesce settings since it seems likely that bursts of traffic might be causing frames to be dropped before the interrupt can even service them.  You can view and change the coalesce settings with 'ethtool -c eth0' and 'ethcool -C eth0 [new values]' if needed.

If you are willing, I would try with the latest driver and increase the rx ring buffer size to see if that helps.

Comment 8 Marcus Alves Grando 2009-02-09 19:03:21 UTC
(In reply to comment #7)

Andy,

I've tried to increase ring buffer and I've got different behavior.

In bnx2 1.6.9 does not work:

[18:48:13] root@mail-fe01(beleterro):~# ethtool -g eth0
Ring parameters for eth0:
Pre-set maximums:
RX:		1020
RX Mini:	0
RX Jumbo:	0
TX:		255
Current hardware settings:
RX:		1020
RX Mini:	0
RX Jumbo:	0
TX:		255
[18:48:17] root@mail-fe01(beleterro):~# ethtool -g eth1
Ring parameters for eth1:
Pre-set maximums:
RX:		1020
RX Mini:	0
RX Jumbo:	0
TX:		255
Current hardware settings:
RX:		1020
RX Mini:	0
RX Jumbo:	0
TX:		255
[18:48:44] root@mail-fe01(beleterro):~# while true; do ethtool -S eth0 | grep disc; ethtool -S eth1 | grep disc; ifconfig | grep dropp; echo "-----------"; sleep 30; done
     rx_discards: 0
     rx_fw_discards: 11034
     rx_discards: 0
     rx_fw_discards: 0
          RX packets:6057474 errors:0 dropped:11034 overruns:0 frame:0
          TX packets:7862766 errors:0 dropped:0 overruns:0 carrier:0
          RX packets:9741815 errors:0 dropped:0 overruns:0 frame:0
          TX packets:8255022 errors:0 dropped:0 overruns:0 carrier:0
          RX packets:1942205347 errors:0 dropped:0 overruns:0 frame:0
          TX packets:1942205347 errors:0 dropped:0 overruns:0 carrier:0
-----------
     rx_discards: 0
     rx_fw_discards: 11946
     rx_discards: 0
     rx_fw_discards: 0
          RX packets:6312287 errors:0 dropped:11946 overruns:0 frame:0
          TX packets:8190477 errors:0 dropped:0 overruns:0 carrier:0
          RX packets:10070560 errors:0 dropped:0 overruns:0 frame:0
          TX packets:8532193 errors:0 dropped:0 overruns:0 carrier:0
          RX packets:1942287609 errors:0 dropped:0 overruns:0 frame:0
          TX packets:1942287609 errors:0 dropped:0 overruns:0 carrier:0
-----------
     rx_discards: 0
     rx_fw_discards: 12012
     rx_discards: 0
     rx_fw_discards: 0
          RX packets:6548230 errors:0 dropped:12012 overruns:0 frame:0
          TX packets:8484342 errors:0 dropped:0 overruns:0 carrier:0
          RX packets:10421532 errors:0 dropped:0 overruns:0 frame:0
          TX packets:8832216 errors:0 dropped:0 overruns:0 carrier:0
          RX packets:1942374543 errors:0 dropped:0 overruns:0 frame:0
          TX packets:1942374543 errors:0 dropped:0 overruns:0 carrier:0
-----------
     rx_discards: 0
     rx_fw_discards: 12657
     rx_discards: 0
     rx_fw_discards: 0
          RX packets:6794666 errors:0 dropped:12657 overruns:0 frame:0
          TX packets:8826395 errors:0 dropped:0 overruns:0 carrier:0
          RX packets:10825648 errors:0 dropped:0 overruns:0 frame:0
          TX packets:9167128 errors:0 dropped:0 overruns:0 carrier:0
          RX packets:1942466484 errors:0 dropped:0 overruns:0 frame:0
          TX packets:1942466484 errors:0 dropped:0 overruns:0 carrier:0

In bnx2 1.7.9 + your netdump patches works fine until reset nic:

[18:54:35] root@mail-fe05(temora):~# ethtool -g eth0
Ring parameters for eth0:
Pre-set maximums:
RX:		1020
RX Mini:	0
RX Jumbo:	4080
TX:		255
Current hardware settings:
RX:		1020
RX Mini:	0
RX Jumbo:	0
TX:		255
[18:54:38] root@mail-fe05(temora):~# ethtool -g eth1
Ring parameters for eth1:
Pre-set maximums:
RX:		1020
RX Mini:	0
RX Jumbo:	4080
TX:		255
Current hardware settings:
RX:		1020
RX Mini:	0
RX Jumbo:	0
TX:		255
[18:54:39] root@mail-fe05(temora):~# while true; do ethtool -S eth0 | grep disc; ethtool -S eth1 | grep disc; ifconfig | grep dropp; echo "-----------"; sleep 30; done
     rx_discards: 0
     rx_fw_discards: 0
     rx_discards: 0
     rx_fw_discards: 0
          RX packets:1180452 errors:0 dropped:0 overruns:0 frame:0
          TX packets:1614812 errors:0 dropped:0 overruns:0 carrier:0
          RX packets:10638187 errors:0 dropped:0 overruns:0 frame:0
          TX packets:9327734 errors:0 dropped:0 overruns:0 carrier:0
          RX packets:3015973 errors:0 dropped:0 overruns:0 frame:0
          TX packets:3015973 errors:0 dropped:0 overruns:0 carrier:0
-----------
     rx_discards: 0
     rx_fw_discards: 0
     rx_discards: 0
     rx_fw_discards: 0
          RX packets:1402519 errors:0 dropped:0 overruns:0 frame:0
          TX packets:1904640 errors:0 dropped:0 overruns:0 carrier:0
          RX packets:11024476 errors:0 dropped:0 overruns:0 frame:0
          TX packets:9667036 errors:0 dropped:0 overruns:0 carrier:0
          RX packets:3129453 errors:0 dropped:0 overruns:0 frame:0
          TX packets:3129453 errors:0 dropped:0 overruns:0 carrier:0
-----------
     rx_discards: 0
     rx_fw_discards: 0
     rx_discards: 0
     rx_fw_discards: 0
          RX packets:1623194 errors:0 dropped:0 overruns:0 frame:0
          TX packets:2209672 errors:0 dropped:0 overruns:0 carrier:0
          RX packets:11390446 errors:0 dropped:0 overruns:0 frame:0
          TX packets:9988863 errors:0 dropped:0 overruns:0 carrier:0
          RX packets:3238349 errors:0 dropped:0 overruns:0 frame:0
          TX packets:3238349 errors:0 dropped:0 overruns:0 carrier:0

After +-15min without any error, nic reseting, see below:

bnx2: eth0 NIC Copper Link is Up, 1000 Mbps full duplex
bnx2: eth1 NIC Copper Link is Up, 1000 Mbps full duplex
NETDEV WATCHDOG: eth0: transmit timed out
bnx2: eth0 NIC Copper Link is Down
bnx2: eth0 NIC Copper Link is Up, 1000 Mbps full duplex
NETDEV WATCHDOG: eth0: transmit timed out
bnx2: eth0 NIC Copper Link is Down
bnx2: eth0 NIC Copper Link is Up, 1000 Mbps full duplex
NETDEV WATCHDOG: eth0: transmit timed out
bnx2: eth0 NIC Copper Link is Down
bnx2: eth0 NIC Copper Link is Up, 1000 Mbps full duplex
NETDEV WATCHDOG: eth0: transmit timed out
bnx2: eth0 NIC Copper Link is Down
bnx2: eth0 NIC Copper Link is Up, 1000 Mbps full duplex

* Coalesce default:

[18:59:48] root@mail-fe05(temora):~# ethtool -c eth0
Coalesce parameters for eth0:
Adaptive RX: off  TX: off
stats-block-usecs: 999936
sample-interval: 0
pkt-rate-low: 0
pkt-rate-high: 0

rx-usecs: 18
rx-frames: 6
rx-usecs-irq: 18
rx-frames-irq: 6

tx-usecs: 80
tx-frames: 20
tx-usecs-irq: 80
tx-frames-irq: 20

rx-usecs-low: 0
rx-frame-low: 0
tx-usecs-low: 0
tx-frame-low: 0

rx-usecs-high: 0
rx-frame-high: 0
tx-usecs-high: 0
tx-frame-high: 0

What do you suggest to change?

Regards

Comment 9 Andy Gospodarek 2009-02-09 22:33:00 UTC
Marcus, I'll look at the transmit timeouts try to figure out why it's not coming back.

Comment 10 Marcus Alves Grando 2009-02-10 15:11:14 UTC
Andy, another point is:

I have other servers with Intel NICs and works fine with same traffic. Why bnx2 driver is not optimized for high traffic like e1000? Why ring buffer is too low?

Regards

Comment 11 Andy Gospodarek 2009-02-10 17:31:00 UTC
(In reply to comment #10)
> Andy, another point is:
> 
> I have other servers with Intel NICs and works fine with same traffic. Why bnx2
> driver is not optimized for high traffic like e1000? Why ring buffer is too
> low?
> 
> Regards

Many of the kernel developers (myself included) feel like we should not waste kernel memory for things like driver ring buffers when many users will never need them.  That is the main reason why the number of bnx2 ring buffer entries is so small.

Comment 12 Andy Gospodarek 2009-02-10 20:15:43 UTC
(In reply to comment #8)
> 
> After +-15min without any error, nic reseting, see below:
> 
> bnx2: eth0 NIC Copper Link is Up, 1000 Mbps full duplex
> bnx2: eth1 NIC Copper Link is Up, 1000 Mbps full duplex
> NETDEV WATCHDOG: eth0: transmit timed out
> bnx2: eth0 NIC Copper Link is Down
> bnx2: eth0 NIC Copper Link is Up, 1000 Mbps full duplex
> NETDEV WATCHDOG: eth0: transmit timed out
> bnx2: eth0 NIC Copper Link is Down
> bnx2: eth0 NIC Copper Link is Up, 1000 Mbps full duplex
> NETDEV WATCHDOG: eth0: transmit timed out
> bnx2: eth0 NIC Copper Link is Down
> bnx2: eth0 NIC Copper Link is Up, 1000 Mbps full duplex
> NETDEV WATCHDOG: eth0: transmit timed out
> bnx2: eth0 NIC Copper Link is Down
> bnx2: eth0 NIC Copper Link is Up, 1000 Mbps full duplex
> 

Marcus, are there any logs before these messages that look different?  I find it strange that you are suddenly getting tx timeouts.  Does this box transmit a lot of frames as well as receive them?  Specifically I wonder if you see any messages like this:

BUG! Tx ring full when queue awake!

Any other messages are obviously helpful too.

I don't like to blindly recommend increasing the tx ring buffers too, but it might be worth a try if this box often sends as much data as it receives.

Comment 13 Andy Gospodarek 2009-02-10 20:16:38 UTC
Adding upstream maintainer in case Michael has any thoughts on this.

Comment 14 Marcus Alves Grando 2009-02-10 20:35:18 UTC
(In reply to comment #12)
> Marcus, are there any logs before these messages that look different?

No. Only these. Before that only boot messages.

> I find
> it strange that you are suddenly getting tx timeouts.  Does this box transmit a
> lot of frames as well as receive them?  Specifically I wonder if you see any
> messages like this:
> 
> BUG! Tx ring full when queue awake!

No. I've never see this before neither on my servers.

> 
> Any other messages are obviously helpful too.
> 
> I don't like to blindly recommend increasing the tx ring buffers too, but it
> might be worth a try if this box often sends as much data as it receives.

Those servers are mail servers. IMAP/POP/SMTP mounting via NFS. Nothing special.

eth0 (user interface) usually has 80Mb/s OUT and 30MB/s IN and eth1 (nfs interface) has 100Mb/s IN and 15Mb/s OUT.

No one log additional.

Another idea?

Comment 15 Michael Chan 2009-02-10 20:51:51 UTC
Increasing the rx ring size to 1020 usually will be enough to prevent or reduce
the number of dropped packets.  But sometimes, 1020 may not be big enough
still.  DaveM would not allow me to increase the max beyond 1020 several years
ago.

I don't know why some versions will continue to drop with ring size 1020, and
some versions will not drop anymore.  Is the traffic pattern the same when
trying different versions?  If it continues to drop, we can experiment by
increasing the ring beyond 1020.  Just change the MAX_RX_RINGS from 4 to 8, and
MAX_RX_PG_RINGS from 16 to 32 in bnx2.h.

I don't know why we get transmit timeouts.  We haven't seen those for a long
time in our lab.  One possibility is that the bnx2 NIC is receiving a ton of
flow control packets, preventing it from sending out any packets and causing
the timeout.  But this happens very rarely.

Comment 16 Marcus Alves Grando 2009-02-11 01:18:36 UTC
Guys,

I tested some scenarios and results are:

kernel 2.6.9-78.0.1 bnx2 1.6.9 with ring size 1020: dropped packets often.
kernel 2.6.9-78.0.13 bnx2 1.7.9-1 without ring size 1020: dropped packets often.
kernel 2.6.9-78.0.13 bnx2 1.7.9-1 with ring size 1020: worked fine.
kernel 2.6.9-78.0.13 bnx2 1.7.9-1 with ring size 1020 and netdump patches[1]: does not dropped packets but transmited timed out

[1]:
http://people.redhat.com/agospoda/rhel4/0069-bnx2-fixup-poll_controller-routine.patch
http://people.redhat.com/agospoda/rhel4/0072-bnx2-enable-netdump-again.patch
http://people.redhat.com/agospoda/rhel4/0073-bnx2-fixup-needed-to-allow-netdump-operation-to-com.patch
http://people.redhat.com/agospoda/rhel4/0074-bnx2-enable-netdump-again.patch

I can test Michael tip but why from bnx2 1.5.11 worked much better than 1.[6-7].9? Does not dropped packets in NFS and in frontnet a number of dropped packets are low.

About transmit timed out seem something in netdump patches, isn't Andy?

Regards

Comment 17 Andy Gospodarek 2009-02-11 02:35:40 UTC
Marcus, thank you so much for the detailed testing and analysis!  I will take a look at the netdump patches and see if I can understand what might be happening.

Comment 18 Andy Gospodarek 2009-02-11 03:21:24 UTC
(In reply to comment #16)
> Guys,
> 
> 
> kernel 2.6.9-78.0.13 bnx2 1.7.9-1 with ring size 1020: worked fine.
> kernel 2.6.9-78.0.13 bnx2 1.7.9-1 with ring size 1020 and netdump patches[1]:
> does not dropped packets but transmited timed out
> 

Since adding these patches caused problems, I wanted to break-down which ones could be the problem with a short description of each:

> [1]:
> http://people.redhat.com/agospoda/rhel4/0069-bnx2-fixup-poll_controller-routine.patch

This patch was only used when doing netconsole or netdump, so I do not think it is effecting you.


> http://people.redhat.com/agospoda/rhel4/0072-bnx2-enable-netdump-again.patch

This patch was the first one that could be causing problems.  It was the first attempt to poll each dummy_netdev contained in each napi instance individually.  The dummy_netdevs were added so we could do msi-x, but we needed to unify them under the regular netdev and use that for polling so netdump would work.

> http://people.redhat.com/agospoda/rhel4/0073-bnx2-fixup-needed-to-allow-netdump-operation-to-com.patch

First attempt to fixup netdump problems found with previous patch.

> http://people.redhat.com/agospoda/rhel4/0074-bnx2-enable-netdump-again.patch

Final changes that seemed to make netdump behave as I expected.



Marcus, can you paste the contents of /proc/interrupts when running any version of 1.7.9-1?  I presume you are using MSI?

Comment 19 Andy Gospodarek 2009-02-11 04:34:39 UTC
I was briefly able to reproduce ONE TX timeout on my system, so I think that's progress.  I'm not exactly sure how I did it, but I ran netperf for a while an when it stopped working I started pinging and doing all sorts of other network tests but I finally made it die.  I'm not sure if I can reproduce it again, but I've made a small change (part of a patch from upstream), and I'm going to let it run while I sleep.

diff --git a/drivers/net/bnx2.c b/drivers/net/bnx2.c
index 22580ab..364294d 100644
--- a/drivers/net/bnx2.c
+++ b/drivers/net/bnx2.c
@@ -3195,14 +3195,15 @@ static int bnx2_poll(struct net_device *dev, int *budget)
 
                work_done = bnx2_poll_work(bp, bnapi, work_done, *budget);
 
-               if (unlikely(work_done >= *budget))
-                       break;
-
                /* bnapi->last_status_idx is used below to tell the hw how
                 * much work has been processed, so we must read it before
                 * checking for more work.
                 */
                bnapi->last_status_idx = sblk->status_idx;
+
+               if (unlikely(work_done >= *budget))
+                       break;
+
                rmb();
                if (likely(!bnx2_has_work(bnapi))) {
                        if (likely(bp->flags & BNX2_FLAG_USING_MSI_OR_MSIX)) {

Comment 20 Andy Gospodarek 2009-02-11 04:48:07 UTC
That patch didn't do much -- I just managed to hang the network interface again. I'll do more testing and inspection tomorrow.

Comment 21 Marcus Alves Grando 2009-02-11 12:32:15 UTC
(In reply to comment #18)
> Marcus, can you paste the contents of /proc/interrupts when running any version
> of 1.7.9-1?  I presume you are using MSI?

root@mail-fe05(temora):~# cat /proc/interrupts 
           CPU0       CPU1       CPU2       CPU3       CPU4       CPU5       CPU6       CPU7       
  0:     285342     282634     125053     122559    2506162    2506190    2506205    2513340    IO-APIC-edge  timer
  1:          0          0          0          0          0          0          0          9    IO-APIC-edge  i8042
  8:          6          8          5          3         59         55         56         63    IO-APIC-edge  rtc
  9:          0          0          0          0          0          0          0          0   IO-APIC-level  acpi
 14:          0          1          0          6          0        408         60        412    IO-APIC-edge  ide0
 66:       5977          0          0          0    1914233      98700    1809029      94163       PCI-MSI-X  cciss0
169:          0          0          0          0          0          0          1         38   IO-APIC-level  ohci_hcd
177:          0          0          0          0          0          0          0         17   IO-APIC-level  ehci_hcd
193:  115293886          0          0          1          1          0          0       5763   IO-APIC-level  eth1
201:          0          0          0          0          0          1          0         78   IO-APIC-level  uhci_hcd
209:          0          0  121990831          0          0          0          0        197   IO-APIC-level  eth0
NMI:          0          0          0          0          0          0          0          0 
LOC:   10847365   10847365   10847364   10847363   10847361   10847361   10847360   10847359 
ERR:          1
MIS:          0


No. I'm not using MSI. My nics appear does not support MSI.

Regards

Comment 22 Andy Gospodarek 2009-02-11 20:11:33 UTC
Ok, I discovered what is wrong.  I should have a patch for you to test in a few minutes.

Comment 23 Andy Gospodarek 2009-02-11 22:49:23 UTC
Created attachment 331629 [details]
bnx2-fixup-problems-with-netpoll-implementation.patch

I tested this patch and everything seems to work well.  Netdump still works and I as able to run netperf for about 2 hours without any issues.  Please apply it on top of the other patches you have already applied.

Comment 24 Andy Gospodarek 2009-02-12 12:35:11 UTC
My test kernels have been updated to include a patch for this bugzilla.

http://people.redhat.com/agospoda/#rhel4

Please test them and report back your results.  Without immediate
feedback there is a good chance this or any other fix for this driver
will not be included in the upcoming update.

Comment 25 Andy Gospodarek 2009-02-12 12:40:54 UTC
My test kernels have been updated to include a patch for this bugzilla.

http://people.redhat.com/agospoda/#rhel4

Please test them and report back your results.  Without immediate
feedback there is a good chance this or any other fix for this driver
will not be included in the upcoming update.

Comment 26 Marcus Alves Grando 2009-02-12 14:39:30 UTC
(In reply to comment #23)
> Created an attachment (id=331629) [details]
> bnx2-fixup-problems-with-netpoll-implementation.patch
> 
> I tested this patch and everything seems to work well.  Netdump still works and
> I as able to run netperf for about 2 hours without any issues.  Please apply it
> on top of the other patches you have already applied.

Andy,

I tested and works fine for at least 3 hours. But I tested with high ring size. I'll test with ring size default now.

     rx_discards: 0
     rx_fw_discards: 1871
     rx_discards: 0
     rx_fw_discards: 0
          RX packets:103321728 errors:0 dropped:1871 overruns:0 frame:0
          TX packets:140809181 errors:0 dropped:0 overruns:0 carrier:0
          RX packets:174826927 errors:0 dropped:0 overruns:0 frame:0
          TX packets:152450304 errors:0 dropped:0 overruns:0 carrier:0
          RX packets:47868681 errors:0 dropped:0 overruns:0 frame:0
          TX packets:47868681 errors:0 dropped:0 overruns:0 carrier:0

Regards

Comment 27 Marcus Alves Grando 2009-02-12 17:51:03 UTC
Andy, I tested again without ring changes... works better than 1.7.9 without your patches. Now eth1 interface does not dropped any packet and eth0 (users interface) dropped some, but less than before. Now with your patches, works fine like 1.5.11.

# ethtool -S eth0 | grep disc; ethtool -S eth1 | grep disc
     rx_discards: 0
     rx_fw_discards: 30871
     rx_discards: 0
     rx_fw_discards: 0

# ifconfig | grep dropp
          RX packets:101959202 errors:0 dropped:30871 overruns:0 frame:0
          TX packets:138883699 errors:0 dropped:0 overruns:0 carrier:0
          RX packets:175810912 errors:0 dropped:0 overruns:0 frame:0
          TX packets:153077867 errors:0 dropped:0 overruns:0 carrier:0
          RX packets:49959776 errors:0 dropped:0 overruns:0 frame:0
          TX packets:49959776 errors:0 dropped:0 overruns:0 carrier:0

# dmesg | tail -n 3
ip_conntrack version 2.1 (32768 buckets, 262144 max) - 348 bytes per conntrack
bnx2: eth0 NIC Copper Link is Up, 1000 Mbps full duplex
bnx2: eth1 NIC Copper Link is Up, 1000 Mbps full duplex

# uptime
 17:47:59 up  3:10,  1 user,  load average: 35.79, 34.05, 33.77

Comment 28 Andy Gospodarek 2009-02-13 22:13:48 UTC
Created attachment 331879 [details]
bnx2-revert-multiqueue-support-and-fix-netdump.patch

Marcus, I took at look at those huge patches I did, and decided I would rather not support multiqueue receive on RHEL4 than risk all that change.  I realize you have done quite a bit of testing for me, but if you can test one more patch I would really appreciate it.  

This one can be applied directly on top of the -80 kernel.  No other patches from my test kernels are needed.  Thanks!

Comment 29 RHEL Program Management 2009-02-13 22:18:36 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 30 Marcus Alves Grando 2009-02-16 13:51:41 UTC
(In reply to comment #28)
> Created an attachment (id=331879) [details]
> bnx2-revert-multiqueue-support-and-fix-netdump.patch
> 
> Marcus, I took at look at those huge patches I did, and decided I would rather
> not support multiqueue receive on RHEL4 than risk all that change.  I realize
> you have done quite a bit of testing for me, but if you can test one more patch
> I would really appreciate it.  
> 
> This one can be applied directly on top of the -80 kernel.  No other patches
> from my test kernels are needed.  Thanks!

Andy, first try does not boot. I'll verify crash dump and sent to you.

Regards

Comment 31 Andy Gospodarek 2009-02-16 15:51:07 UTC
Marcus, sorry to hear that.  I will try and reproduce that here too.

Comment 32 Marcus Alves Grando 2009-02-16 16:16:25 UTC
Info from crash:

please wait... (gathering module symbol data)   
WARNING: cannot access vmalloc'd module memory

      KERNEL: /usr/lib/debug/lib/modules/2.6.9-81.0.1.ELsmp/vmlinux
    DUMPFILE: vmcore-incomplete  [PARTIAL DUMP]
        CPUS: 8
        DATE: Mon Feb 16 13:25:37 2009
      UPTIME: 00:03:50
LOAD AVERAGE: 0.20, 0.05, 0.01
       TASKS: 88
    NODENAME: temora.hst.terra.com.br
     RELEASE: 2.6.9-81.0.1.ELsmp
     VERSION: #1 SMP Sun Feb 15 00:20:41 UTC 2009
     MACHINE: i686  (2612 Mhz)
      MEMORY: 18 GB
       PANIC: "Oops: 0002 [#1]" (check log for details)
         PID: 3584
     COMMAND: "ip"
        TASK: f63e7930  [THREAD_INFO: f7acc000]
         CPU: 2
       STATE: TASK_RUNNING (PANIC)

crash> log
[...]
disk_dump: total blocks required: 4193920 (header 3 + bitmap 144 + memory 4193773)
ip_tables: (C) 2000-2002 Netfilter core team
ip_conntrack version 2.1 (32768 buckets, 262144 max) - 348 bytes per conntrack
Unable to handle kernel NULL pointer dereference at virtual address 00000024
 printing eip:
f88f63ba
*pde = 0a383001
Oops: 0002 [#1]
SMP 
Modules linked in: iptable_filter ipt_REDIRECT iptable_nat ip_conntrack ip_tables ide_dump cciss_dump scsi_dump diskdump zlib_deflate xfs joydev dm_mirror dm_mod button battery ac ohci_hcd ehci_hcd uhci_hcd k8_
edac edac_mc bnx2 floppy ext3 jbd cciss sd_mod scsi_mod
CPU:    2
EIP:    0060:[<f88f63ba>]    Not tainted VLI
EFLAGS: 00010297   (2.6.9-81.0.1.ELsmp) 
EIP is at bnx2_napi_enable+0x15/0x29 [bnx2]
eax: 00000000   ebx: 00000000   ecx: f6919240   edx: f69192d0
esi: f6919240   edi: f6919000   ebp: 00000000   esp: f7acced4
ds: 007b   es: 007b   ss: 0068
Process ip (pid: 3584, threadinfo=f7acc000 task=f63e7930)
Stack: f6919000 f88fcce6 f6919000 00000000 00001002 00000000 c0288515 f6919000 
       00001003 c0289988 00000000 ffffff9d f7accf38 00000000 c02c1f02 00000000 
       00000000 f6919000 00000000 bfe8f840 00008914 08041003 bfe8f8c4 00e76638 
Call Trace:
 [<f88fcce6>] bnx2_open+0x52/0x154 [bnx2]
 [<c0288515>] dev_open+0x2e/0x6d
 [<c0289988>] dev_change_flags+0x48/0xed
 [<c02c1f02>] devinet_ioctl+0x2b2/0x61d
 [<c02c3aff>] inet_ioctl+0x79/0xa5
 [<c0280801>] sock_ioctl+0x28c/0x2b4
 [<c016db72>] sys_ioctl+0x227/0x269
 [<c02ddb2f>] syscall_call+0x7/0xb
Code: 5d 9e c7 eb d9 45 83 c3 28 3b ae 6c 04 00 00 7c cb 5b 5e 5f 5d c3 53 31 db 89 c1 3b 98 6c 04 00 00 7d 1a 8d 90 90 00 00 00 8b 02 <f0> 0f ba 70 24 05 43 83 c2 28 3b 99 6c 04 00 00 7c ec 5b c3 57

crash> bt
PID: 3584   TASK: f63e7930  CPU: 2   COMMAND: "ip"
 #0 [f7accd84] die at c010604e
 #1 [f7accdb4] do_page_fault at c011bad4
 #2 [f7acceec] dev_open at c0288513
 #3 [f7accef8] dev_change_flags at c0289986
 #4 [f7accf0c] devinet_ioctl at c02c1efd
 #5 [f7accf68] inet_ioctl at c02c3afa
 #6 [f7accf7c] sock_ioctl at c02807fe
 #7 [f7accf94] sys_ioctl at c016db6f
 #8 [f7accfc0] system_call at c02ddb28
    EAX: 00000036  EBX: 00000003  ECX: 00008914  EDX: bfe8f840 
    DS:  007b      ESI: 00000003  ES:  007b      EDI: bfe8f840
    SS:  007b      ESP: bfe8f7e8  EBP: bfe8f958
    CS:  0073      EIP: 00206834  ERR: 00000036  EFLAGS: 00000206

Anything else?

Regards

Comment 33 Andy Gospodarek 2009-02-16 17:06:24 UTC
Thanks, Marcus.  I think I found this problem.  Just testing with netperf and netdump with msi enabled and disabled and then I will post a new patch to obsolete the patch in comment #28.

Comment 34 Andy Gospodarek 2009-02-16 17:43:09 UTC
Created attachment 332069 [details]
bnx2-final-working-and-tested.patch

This patch applied to 2.6.9-80 should work.  I've been running netperf for about 30 minutes and I have been able to netdump with msi and legacy mode interrupts without problems.

Thanks again for testing these patches, Marcus!

Comment 35 Marcus Alves Grando 2009-02-16 19:31:35 UTC
(In reply to comment #34)
> Created an attachment (id=332069) [details]
> bnx2-final-working-and-tested.patch

Andy, this one works perfectly with high RX ring. Without touch RX ring, NFS interface and user interface dropped packets too often.

With high RX ring, interface dropped packets when traffic are near 4x of a normal server, otherwise does not dropped any packet.

------ Without high RX ring --------
[19:10:43] root@mail-fe05(temora):~# while true; do ethtool -S eth0 | grep disc; ethtool -S eth1 | grep disc; ifconfig | grep dropp; sleep 10; done
     rx_discards: 0
     rx_fw_discards: 1288
     rx_discards: 0
     rx_fw_discards: 229
          RX packets:3458894 errors:0 dropped:1288 overruns:0 frame:0
          TX packets:4670440 errors:0 dropped:0 overruns:0 carrier:0
          RX packets:6672043 errors:0 dropped:229 overruns:0 frame:0
          TX packets:5863899 errors:0 dropped:0 overruns:0 carrier:0
          RX packets:2006154 errors:0 dropped:0 overruns:0 frame:0
          TX packets:2006154 errors:0 dropped:0 overruns:0 carrier:0
     rx_discards: 0
     rx_fw_discards: 1288
     rx_discards: 0
     rx_fw_discards: 346
          RX packets:3604345 errors:0 dropped:1288 overruns:0 frame:0
          TX packets:4878953 errors:0 dropped:0 overruns:0 carrier:0
          RX packets:6960446 errors:0 dropped:346 overruns:0 frame:0
          TX packets:6113120 errors:0 dropped:0 overruns:0 carrier:0
          RX packets:2074435 errors:0 dropped:0 overruns:0 frame:0
          TX packets:2074435 errors:0 dropped:0 overruns:0 carrier:0
     rx_discards: 0
     rx_fw_discards: 1426
     rx_discards: 0
     rx_fw_discards: 352
          RX packets:3762686 errors:0 dropped:1426 overruns:0 frame:0
          TX packets:5104817 errors:0 dropped:0 overruns:0 carrier:0
          RX packets:7215366 errors:0 dropped:352 overruns:0 frame:0
          TX packets:6336175 errors:0 dropped:0 overruns:0 carrier:0
          RX packets:2141348 errors:0 dropped:0 overruns:0 frame:0
          TX packets:2141348 errors:0 dropped:0 overruns:0 carrier:0
------ Without high RX ring --------

------ With high RX ring --------
[19:24:48] root@mail-fe05(temora):~# while true; do ethtool -S eth0 | grep disc; ethtool -S eth1 | grep disc; ifconfig | grep dropp; sleep 30; done
     rx_discards: 0
     rx_fw_discards: 649
     rx_discards: 0
     rx_fw_discards: 0
          RX packets:12079733 errors:0 dropped:649 overruns:0 frame:0
          TX packets:16469995 errors:0 dropped:0 overruns:0 carrier:0
          RX packets:21344168 errors:0 dropped:0 overruns:0 frame:0
          TX packets:18720164 errors:0 dropped:0 overruns:0 carrier:0
          RX packets:7777040 errors:0 dropped:0 overruns:0 frame:0
          TX packets:7777040 errors:0 dropped:0 overruns:0 carrier:0
     rx_discards: 0
     rx_fw_discards: 649
     rx_discards: 0
     rx_fw_discards: 0
          RX packets:12500338 errors:0 dropped:649 overruns:0 frame:0
          TX packets:17042594 errors:0 dropped:0 overruns:0 carrier:0
          RX packets:22055818 errors:0 dropped:0 overruns:0 frame:0
          TX packets:19346640 errors:0 dropped:0 overruns:0 carrier:0
          RX packets:7979074 errors:0 dropped:0 overruns:0 frame:0
          TX packets:7979074 errors:0 dropped:0 overruns:0 carrier:0
     rx_discards: 0
     rx_fw_discards: 649
     rx_discards: 0
     rx_fw_discards: 0
          RX packets:12909651 errors:0 dropped:649 overruns:0 frame:0
          TX packets:17587961 errors:0 dropped:0 overruns:0 carrier:0
          RX packets:22724569 errors:0 dropped:0 overruns:0 frame:0
          TX packets:19930921 errors:0 dropped:0 overruns:0 carrier:0
          RX packets:8176045 errors:0 dropped:0 overruns:0 frame:0
          TX packets:8176045 errors:0 dropped:0 overruns:0 carrier:0
------ With high RX ring --------

If it's latest patch, it's possible bump RX ring? Otherwise default RX ring works worst than 1.5.11-rh bnx2 version.

Regards

Comment 36 Andy Gospodarek 2009-02-17 03:07:48 UTC
When the traffic that is dropping frames is running, does it burst a large amount of data over a short period of time or does it just sent a continuous amount of traffic at a high rate?

Comment 37 Marcus Alves Grando 2009-02-17 16:59:42 UTC
(In reply to comment #36)
> When the traffic that is dropping frames is running, does it burst a large
> amount of data over a short period of time or does it just sent a continuous
> amount of traffic at a high rate?

I don't know, but I'll collect some tcpdumps to verify.

Regards

Comment 38 Marcus Alves Grando 2009-02-19 14:21:56 UTC
Andy,

These samples are without ring changes and a half load of other servers. I collected every five seconds, see below some drops.

eth0
	RXbytes:10437437  RXpackets:19062  RXerrs:0  RXdrop:0  RXfifo:0  RXframe:0  RXcompressed:0  RXmulticast:0  
	TXbytes:16397765  TXpackets:21448  TXerrs:0  TXdrop:0  TXfifo:0  TXcolls:0  TXcarrier:0  TXcompressed:0  
eth1
	RXbytes:23878763  RXpackets:30482  RXerrs:0  RXdrop:0  RXfifo:0  RXframe:0  RXcompressed:0  RXmulticast:0  
	TXbytes:4360726  TXpackets:27378  TXerrs:0  TXdrop:0  TXfifo:0  TXcolls:0  TXcarrier:0  TXcompressed:0  
eth0
	RXbytes:9810484  RXpackets:23023  RXerrs:0  RXdrop:551  RXfifo:0  RXframe:0  RXcompressed:0  RXmulticast:0  
	TXbytes:39043646  TXpackets:36548  TXerrs:0  TXdrop:0  TXfifo:0  TXcolls:0  TXcarrier:0  TXcompressed:0  
eth1
	RXbytes:38111763  RXpackets:42258  RXerrs:0  RXdrop:0  RXfifo:0  RXframe:0  RXcompressed:0  RXmulticast:0  
	TXbytes:5396406  TXpackets:37600  TXerrs:0  TXdrop:0  TXfifo:0  TXcolls:0  TXcarrier:0  TXcompressed:0  
eth0
	RXbytes:8344257  RXpackets:23698  RXerrs:0  RXdrop:15  RXfifo:0  RXframe:0  RXcompressed:0  RXmulticast:0  
	TXbytes:35800397  TXpackets:34272  TXerrs:0  TXdrop:0  TXfifo:0  TXcolls:0  TXcarrier:0  TXcompressed:0  
eth1
	RXbytes:37071245  RXpackets:41511  RXerrs:0  RXdrop:0  RXfifo:0  RXframe:0  RXcompressed:0  RXmulticast:0  
	TXbytes:5262631  TXpackets:37639  TXerrs:0  TXdrop:0  TXfifo:0  TXcolls:0  TXcarrier:0  TXcompressed:0

----------------

eth0
	RXbytes:10403610  RXpackets:19183  RXerrs:0  RXdrop:0  RXfifo:0  RXframe:0  RXcompressed:0  RXmulticast:0  
	TXbytes:20079846  TXpackets:23413  TXerrs:0  TXdrop:0  TXfifo:0  TXcolls:0  TXcarrier:0  TXcompressed:0  
eth1
	RXbytes:24871627  RXpackets:30783  RXerrs:0  RXdrop:0  RXfifo:0  RXframe:0  RXcompressed:0  RXmulticast:0  
	TXbytes:4258572  TXpackets:27103  TXerrs:0  TXdrop:0  TXfifo:0  TXcolls:0  TXcarrier:0  TXcompressed:0  
eth0
	RXbytes:9026210  RXpackets:24906  RXerrs:0  RXdrop:197  RXfifo:0  RXframe:0  RXcompressed:0  RXmulticast:0  
	TXbytes:53757271  TXpackets:45666  TXerrs:0  TXdrop:0  TXfifo:0  TXcolls:0  TXcarrier:0  TXcompressed:0  
eth1
	RXbytes:32971682  RXpackets:35901  RXerrs:0  RXdrop:0  RXfifo:0  RXframe:0  RXcompressed:0  RXmulticast:0  
	TXbytes:4444185  TXpackets:30640  TXerrs:0  TXdrop:0  TXfifo:0  TXcolls:0  TXcarrier:0  TXcompressed:0

Something else?

Regards

Comment 39 Marcus Alves Grando 2009-02-19 14:37:54 UTC
These sample are taken every one second

eth0
	RXbytes:1810340  RXpackets:3426  RXerrs:0  RXdrop:0  RXfifo:0  RXframe:0  RXcompressed:0  RXmulticast:0  
	TXbytes:3036694  TXpackets:3936  TXerrs:0  TXdrop:0  TXfifo:0  TXcolls:0  TXcarrier:0  TXcompressed:0  
eth1
	RXbytes:3450924  RXpackets:4760  RXerrs:0  RXdrop:0  RXfifo:0  RXframe:0  RXcompressed:0  RXmulticast:0  
	TXbytes:717496  TXpackets:4316  TXerrs:0  TXdrop:0  TXfifo:0  TXcolls:0  TXcarrier:0  TXcompressed:0  
eth0
	RXbytes:2194973  RXpackets:7155  RXerrs:0  RXdrop:193  RXfifo:0  RXframe:0  RXcompressed:0  RXmulticast:0  
	TXbytes:23052541  TXpackets:17168  TXerrs:0  TXdrop:0  TXfifo:0  TXcolls:0  TXcarrier:0  TXcompressed:0  
eth1
	RXbytes:3356591  RXpackets:4336  RXerrs:0  RXdrop:0  RXfifo:0  RXframe:0  RXcompressed:0  RXmulticast:0  
	TXbytes:616822  TXpackets:3887  TXerrs:0  TXdrop:0  TXfifo:0  TXcolls:0  TXcarrier:0  TXcompressed:0  
eth0
	RXbytes:1513523  RXpackets:3472  RXerrs:0  RXdrop:0  RXfifo:0  RXframe:0  RXcompressed:0  RXmulticast:0  
	TXbytes:3394439  TXpackets:4311  TXerrs:0  TXdrop:0  TXfifo:0  TXcolls:0  TXcarrier:0  TXcompressed:0  
eth1
	RXbytes:5009266  RXpackets:6344  RXerrs:0  RXdrop:0  RXfifo:0  RXframe:0  RXcompressed:0  RXmulticast:0  
	TXbytes:869859  TXpackets:5564  TXerrs:0  TXdrop:0  TXfifo:0  TXcolls:0  TXcarrier:0  TXcompressed:0

Comment 40 Andy Gospodarek 2009-02-20 14:30:15 UTC
Marcus, I'm not sure what to think about this.  These numbers are the same as the rx_fw_discards ethtool stat, right?

Was this a problem with 1.5.11 or 1.6.9?

Comment 42 Marcus Alves Grando 2009-02-26 01:17:00 UTC
(In reply to comment #40)
> Marcus, I'm not sure what to think about this.  These numbers are the same as
> the rx_fw_discards ethtool stat, right?

Yes.

> 
> Was this a problem with 1.5.11 or 1.6.9?

I'll check tomorrow.

Regards

Comment 45 Marcus Alves Grando 2009-02-26 21:59:32 UTC
(In reply to comment #40)
> Marcus, I'm not sure what to think about this.  These numbers are the same as
> the rx_fw_discards ethtool stat, right?
> 
> Was this a problem with 1.5.11 or 1.6.9?

It's happened in 1.5.11 too

# ethtool -i eth0
driver: bnx2
version: 1.5.11-rh
firmware-version: 1.9.6
bus-info: 0000:06:04.0

Collected with 10 seconds interval of netstat:

eth0
	RXbytes:17812447  RXpackets:37704  RXerrs:0  RXdrop:95  RXfifo:0  RXframe:0  RXcompressed:0  RXmulticast:0  
	TXbytes:62352098  TXpackets:56126  TXerrs:0  TXdrop:0  TXfifo:0  TXcolls:0  TXcarrier:0  TXcompressed:0 

# ethtool -S eth0
NIC statistics:
     rx_bytes: 241903798
     rx_error_bytes: 0
     tx_bytes: 1069444900
     tx_error_bytes: 0
     rx_ucast_packets: 693609
     rx_mcast_packets: 0
     rx_bcast_packets: 59
     tx_ucast_packets: 1034147
     tx_mcast_packets: 0
     tx_bcast_packets: 5
     tx_mac_errors: 0
     tx_carrier_errors: 0
     rx_crc_errors: 0
     rx_align_errors: 0
     tx_single_collisions: 0
     tx_multi_collisions: 0
     tx_deferred: 0
     tx_excess_collisions: 0
     tx_late_collisions: 0
     tx_total_collisions: 0
     rx_fragments: 0
     rx_jabbers: 0
     rx_undersize_packets: 0
     rx_oversize_packets: 0
     rx_64_byte_packets: 423534
     rx_65_to_127_byte_packets: 119297
     rx_128_to_255_byte_packets: 8056
     rx_256_to_511_byte_packets: 2227
     rx_512_to_1023_byte_packets: 8902
     rx_1024_to_1522_byte_packets: 131652
     rx_1523_to_9022_byte_packets: 0
     tx_64_byte_packets: 172773
     tx_65_to_127_byte_packets: 103993
     tx_128_to_255_byte_packets: 18100
     tx_256_to_511_byte_packets: 33356
     tx_512_to_1023_byte_packets: 36079
     tx_1024_to_1522_byte_packets: 669851
     tx_1523_to_9022_byte_packets: 0
     rx_xon_frames: 0
     rx_xoff_frames: 0
     tx_xon_frames: 0
     tx_xoff_frames: 0
     rx_mac_ctrl_frames: 0
     rx_filtered_packets: 1183
     rx_discards: 0
     rx_fw_discards: 95

I really don't know why dropped packets with low network usage...

Regards

Comment 46 Andy Gospodarek 2009-02-27 20:40:17 UTC
Even with low overall usage, if you have a burst of traffic (which is likely with a mailserver) you will need to make sure you can handle those bursts when they happen.  I would recommend increasing your ring buffer and leaving it higher.

Comment 47 Marcus Alves Grando 2009-02-27 21:11:06 UTC
(In reply to comment #46)
> Even with low overall usage, if you have a burst of traffic (which is likely
> with a mailserver) you will need to make sure you can handle those bursts when
> they happen.  I would recommend increasing your ring buffer and leaving it
> higher.

It's OK for me. So, your last patch will be commited? He's work fine here.

Thanks a lot.

Comment 48 Andy Gospodarek 2009-02-27 21:19:21 UTC
Yes, the patch in comment #34 will be included in RHEL 4.8.

Thank you for all of your work testing and debugging these patches.

Comment 49 Neil Horman 2009-03-05 18:34:27 UTC
*** Bug 488749 has been marked as a duplicate of this bug. ***

Comment 50 Issue Tracker 2009-03-06 14:06:17 UTC
------- Comment From  2009-03-06 04:08 EDT-------
(In reply to comment #36)
> Hello IBM,
> Please let me know if you would be able to test the kernel from:
> http://people.redhat.com/agospoda/

yes, I tested the kernel-smp-2.6.9-81.EL.gtest.60.x86_64.rpm and it helped
in taking a dump over network. With this kernel, netdump works on the ls21
machine.
thanks

Internal Status set to 'Waiting on Support'
Status set to: Waiting on Tech

This event sent from IssueTracker by jkachuck 
 issue 268900

Comment 51 Vivek Goyal 2009-03-11 14:10:51 UTC
Committed in 83.EL . RPMS are available at http://people.redhat.com/vgoyal/rhel4/

Comment 52 Andy Gospodarek 2009-03-18 18:51:38 UTC
*** Bug 480693 has been marked as a duplicate of this bug. ***

Comment 55 Jan Tluka 2009-04-28 17:06:03 UTC
Patch is in -89.EL kernel.

Comment 58 errata-xmlrpc 2009-05-18 19:35:16 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2009-1024.html


Note You need to log in before you can comment on or make changes to this bug.