Hide Forgot
Created attachment 451427 [details] ethtool and /proc/interrupts captured every 5 seconds Description of problem: A system with 4 bnx2 NICs, 2 of which ar bonded together and are experiencing periodic loss of received packets on some of the NICs. The packet loss can be seen in the ethtool statistics with rx_fw_discards increasing 200-600 packets when the loss event occurs. Version-Release number of selected component (if applicable): The RHEL5.4 system has been experiencing the issue with 5.4 kernels up to the latest 2.6.18-164.25.2.el5 kernel How reproducible: The packet loss occurs regularly on the customer's system. Additional info: From ethtool statistics, a high percentage of the received packets near the time of loss are small packets (<128bytes in size). However, there are other periods in the capture whish show even larger quantities of small packets received without issue. From interrupt statistics, there didn't look to be any obvious conflict with other devices. Cpus with heavier storage interrupt load didn't match cpus with heavy NIC interrupt load.
Also, the receive ring buffers have been increased to their maximum allowed values and the loss still occurs.
Which type(s) of bnx2 nic is experiencing the problem? Is flow control enabled on these nics? Is there any spike in cpu usage that correlates with the dropped packet spikes?
02:00.0 Ethernet controller: Broadcom Corporation NetXtreme II BCM5709 Gigabit Ethernet (rev 20) 02:00.1 Ethernet controller: Broadcom Corporation NetXtreme II BCM5709 Gigabit Ethernet (rev 20) 03:00.0 Ethernet controller: Broadcom Corporation NetXtreme II BCM5709 Gigabit Ethernet (rev 20) 03:00.1 Ethernet controller: Broadcom Corporation NetXtreme II BCM5709 Gigabit Ethernet (rev 20) Cannot get device udp large send offload settings: Operation not supported Offload parameters for eth0: rx-checksumming: on tx-checksumming: on scatter-gather: on tcp segmentation offload: on udp fragmentation offload: off generic segmentation offload: off generic-receive-offload: off sos_commands/networking/ethtool_-k_eth0 (END)
The kernel being used would appear to have the fix for bz511368. Typically, rx_fw_discard increases when the host system can keep up with the NIC, hence the question about cpu usage and the increasing of the buffers, OR when there is a problem internal to the NIC, as in the case of 511368. Looking at the data, eth0 appeared to be dropping 2 thousand packets in 51.5 minutes in 7 chunks, with a max of 600 per chunk so it's kind of jumpy without a rhythm. What is the output from ethtool -a?
Just got sar data for when we last logged dropped packets... 13:40:01 all 12.74 0.00 3.18 0.08 0.00 84.00 13:50:01 all 38.12 0.00 13.61 0.13 0.00 48.14 Fri Oct 1 13:56:56 EDT 2010 rx_fw_discards: 40903 Fri Oct 1 13:57:26 EDT 2010 rx_fw_discards: 41206 14:00:01 all 11.88 0.00 4.36 0.15 0.00 83.60 14:10:01 all 9.73 0.00 2.89 0.06 0.00 87.32 Fri Oct 1 14:11:43 EDT 2010 rx_fw_discards: 41493 Fri Oct 1 14:15:33 EDT 2010 rx_fw_discards: 41867 Fri Oct 1 14:16:29 EDT 2010 rx_fw_discards: 42519 Fri Oct 1 14:17:09 EDT 2010 rx_fw_discards: 42849 14:20:01 all 13.82 0.00 3.17 0.09 0.00 82.93 Fri Oct 1 14:24:19 EDT 2010 rx_fw_discards: 42860 14:30:01 all 18.76 0.00 2.97 0.13 0.00 78.14 14:40:01 all 17.10 0.00 3.37 0.06 0.00 79.47 Fri Oct 1 14:45:22 EDT 2010 rx_fw_discards: 43018 14:50:01 all 19.79 0.00 5.32 0.10 0.00 74.80 00:00:01 INTR intr/s 13:30:01 sum 5693.02 13:40:01 sum 8986.82 13:50:01 sum 7837.95 14:00:01 sum 13478.35 14:10:01 sum 6559.21 14:20:01 sum 7246.13 14:30:01 sum 6460.81 14:40:01 sum 7962.50 14:50:01 sum 6434.55 15:00:01 sum 10088.46 15:10:01 sum 7461.58 00:00:01 IFACE rxpck/s txpck/s rxbyt/s txbyt/s rxcmp/s txcmp/s rxmcst/s 13:30:01 eth0 11095.87 2960.81 13441915.13 2059196.41 0.00 0.00 0.00 13:40:01 eth0 13383.73 5186.04 13306012.25 3359749.89 0.00 0.00 0.00 13:50:01 eth0 14287.57 3935.98 15925465.14 3236305.48 0.00 0.00 0.00 14:00:01 eth0 20476.03 8259.06 17472706.26 4099692.31 0.00 0.00 0.00 14:10:01 eth0 13600.74 3131.51 16334440.24 2249581.55 0.00 0.00 0.00 14:20:01 eth0 14411.35 4024.41 17666003.54 2778519.76 0.00 0.00 0.00 14:30:01 eth0 10039.46 3117.78 12279608.20 2616147.02 0.00 0.00 0.00 14:40:01 eth0 9594.85 3308.80 8515248.20 1679508.82 0.00 0.00 0.00 14:50:01 eth0 10681.58 3890.23 12524835.96 2879344.56 0.00 0.00 0.00 15:00:01 eth0 13907.47 6118.19 13097537.93 4214040.90 0.00 0.00 0.00 15:10:01 eth0 13417.40 4097.86 16354438.15 2861063.49 0.00 0.00 0.00
[root@icgdais10u ~]# ethtool -a eth0 Pause parameters for eth0: Autonegotiate: on RX: off TX: off [root@icgdais10u ~]# ethtool -a eth1 Pause parameters for eth1: Autonegotiate: on RX: on TX: on [root@icgdais10u ~]# ethtool -a eth1 Pause parameters for eth1: Autonegotiate: on RX: on TX: on [root@icgdais10u ~]# ethtool -a eth2 Pause parameters for eth2: Autonegotiate: on RX: off TX: off [root@icgdais10u ~]# ethtool -a eth3 Pause parameters for eth3: Autonegotiate: on RX: on TX: on [root@icgdais10u ~]# ethtool -a eth4 Pause parameters for eth4: Autonegotiate: on RX: on TX: on [root@icgdais10u ~]# ethtool -a eth5 Pause parameters for eth5: Autonegotiate: on RX: on TX: on
Is the system in question having drops on all 4 bnx2 interfaces? The data provided by the first comment only have rx_fw_discards for the first set of NIC statistics, which I assumed to be eth0, so I can't tell about the other interfaces. Is there a reason why RX and TX flow control is disabled on eth0 and eth2? (see comment #6) What NICs are associated with eth4 and eth5? (Just curious because elsewhere it was stated that we were dealing with "a system with 4 bnx2 NICs".) (Thanks for the data.)
Hi John, They are only experiencing the issue on their public bond0 which is comprised of eth0 and eth2. No idea why the flow control is disabled, should we ask them to reenable it? How do they do that? The other nice are intel e1000 Thanks, -Guil
I would like to try it. Basically, this entails enabling the use of the PAUSE frame which can be used to control the rate of transmission. If eth0 and eth2 are being overrun with data, the NIC would have no choice but to drop frames. With PAUSE enabled, some sort of control can be established. So, I would suggest "ethtool -A ethX rx on" and then "tx on" be implemented.
A couple of quick questions: Since the system is question is running 5.4 (right?), could you provide the full output from ethtool -S so the firmware version can be checked? Are bnx2i and cnic loaded in addition to bnx2?
Created attachment 451979 [details] ethtool default, -i, -k, and -S output for all NICs I've attached a tar file which contains files with the ethtool output for all NICs captured by the sosreport. No options, -i, -k, and -S ethtool output are available. Firmware versions the bnx2 ethtool output: eth0 firmware-version: 4.6.4 NCSI 1.0.3 eth1 firmware-version: 4.6.4 NCSI 1.0.3 eth2 firmware-version: 4.6.4 eth3 firmware-version: 4.6.4 The bnx2i and cnic modules are not loaded.
Uploading data. Customer did see packet drops with the test kernel and ring buffer at 1020. I asked them to increase the ring buffer and they have not seen any drops yet. They will keep it that way until Monday. Please take a peek at the data and let me know if there is anything else we would like to try. One last thing. They reported that there were issues with bonding failover with the new kernel. I have not yet attempted to reproduce this yet.
Created attachment 452436 [details] network script output showing drops with 1020 ring buffer
To add some additional perspective, I started thinking about how quickly we can actually get these frames off the wire. At 1Gbps speeds the minimum time to receive all the 64-byte frames that can be cleared in one full NAPI poll (weight=64), would be 37us. (That grows to around ~600us for a queue of 1020 and ~1.2ms for a queue of 2040.) I did some more thinking about the coalesce times used by bnx2-based devices and I think the default of 18us or 12 frames (which would be ~7us at gigabit speeds with 64-byte frames) is reasonable for most cases. I can tell you that we have customers who run more aggressive coalesce settings (even 1-2us) because latency is the most important factor for them. (NOTE: Those users still receive more than one frame with each NAPI poll event since typically frames have continued to arrive since the interrupt popped.) I do think it would be interesting to adjust the coalesce settings and see if your work-load sees any improvement. I think there would be some when these bursts of traffic happen. Out of curiosity, can you also let me know if multicast traffic is the primary traffic coming into the box for this application? If so, how many multicast groups are used? How many different user-space processes are open and listening on sockets receiving that traffic? If the traffic is not multicast, is it primarily unicast TCP or UDP?
(In reply to comment #13) > Created attachment 452436 [details] > network script output showing drops with 1020 ring buffer The rx_fw_discard counter is constant at 610 throughout the log. The earlier attachment shows periodic discards. Do we know for a fact that a different vendor's NIC operating under the same condition (same traffic, same bonding, same ring size, no flow control) is not experiencing drops?
Customer is comfortable that the increased ring buffer cap has resolved their issue and would like some idea of when they may see it in a released kernel. Are we comfortable that this is a valid resolution? How soon could we get them a hotfix if so?
If other vendor's NICs under the same conditions require a similar buffer size to prevent drops, then I'm comfortable with the solution. Getting DaveM to accept the 2K buffer max may not be easy though.
Before we go the increased ring size route with this one, to Andy's point in comment #15, is there any way we can get the customer to change their coalesce values? I have found that reducing either rx-frames and rx-usecs reduces rx_fw_discards from incrementing (when the rx ring value is artificially low). I captured what I did, see below. First the initial setup, just after a reboot: [root@hs22-01 ~]# ethtool -i eth0 driver: bnx2 version: 1.9.3-1 firmware-version: 4.6.4 NCSI 1.0.6 bus-info: 0000:10:00.0 [root@hs22-01 ~]# ethtool -g eth0 Ring parameters for eth0: Pre-set maximums: RX: 2040 RX Mini: 0 RX Jumbo: 8160 TX: 255 Current hardware settings: RX: 255 RX Mini: 0 RX Jumbo: 0 TX: 255 [root@hs22-01 ~]# ethtool -c eth0 Coalesce parameters for eth0: Adaptive RX: off TX: off stats-block-usecs: 999936 sample-interval: 0 pkt-rate-low: 0 pkt-rate-high: 0 rx-usecs: 18 rx-frames: 6 rx-usecs-irq: 18 rx-frames-irq: 6 tx-usecs: 80 tx-frames: 20 tx-usecs-irq: 80 tx-frames-irq: 20 rx-usecs-low: 0 rx-frame-low: 0 tx-usecs-low: 0 tx-frame-low: 0 rx-usecs-high: 0 rx-frame-high: 0 tx-usecs-high: 0 tx-frame-high: 0 Second, reduce the ring buffers to induce discards. [root@hs22-01 ~]# ethtool -G eth0 rx 4 [root@hs22-01 ~]# ethtool -g eth0 Ring parameters for eth0: Pre-set maximums: RX: 2040 RX Mini: 0 RX Jumbo: 8160 TX: 255 Current hardware settings: RX: 4 RX Mini: 0 RX Jumbo: 0 TX: 255 Flood eth0 from another system and check discards: [root@hs22-01 ~]# ethtool -S eth0 |grep rx_fw_discards rx_fw_discards: 8 Lower the rx-usecs from 18 to 8: [root@hs22-01 ~]# ethtool -C eth0 rx-usecs 8 [root@hs22-01 ~]# ethtool -c eth0 Coalesce parameters for eth0: Adaptive RX: off TX: off stats-block-usecs: 999936 sample-interval: 0 pkt-rate-low: 0 pkt-rate-high: 0 rx-usecs: 8 rx-frames: 6 rx-usecs-irq: 18 rx-frames-irq: 6 tx-usecs: 80 tx-frames: 20 tx-usecs-irq: 80 tx-frames-irq: 20 rx-usecs-low: 0 rx-frame-low: 0 tx-usecs-low: 0 tx-frame-low: 0 rx-usecs-high: 0 rx-frame-high: 0 tx-usecs-high: 0 tx-frame-high: 0 Note: changing the coalesce value resets rx_fw_discards. [root@hs22-01 ~]# ethtool -S eth0 |grep rx_fw_discards rx_fw_discards: 0 Flood eth0 from another system and check discards: [root@hs22-01 ~]# ethtool -S eth0 |grep rx_fw_discards rx_fw_discards: 1 Flood again with same load just to make sure. [root@hs22-01 ~]# ethtool -S eth0 |grep rx_fw_discards rx_fw_discards: 2 Reduce rx-frames from 6 to 2 and put rx-usecs back to 18: [root@hs22-01 ~]# ethtool -C eth0 rx-frames 2 [root@hs22-01 ~]# ethtool -C eth0 rx-usecs 18 Flood eth0 from another system and check discards: [root@hs22-01 ~]# ethtool -S eth0 |grep rx_fw_discards rx_fw_discards: 1 And again. [root@hs22-01 ~]# ethtool -S eth0 |grep rx_fw_discards rx_fw_discards: 2 So hopefully this illustrates that coalesce values do effect the rate of discard and that we should persue all avenues currently available before making changes to code.
rx-frames and rx-frames-irq need to be a small fraction of the rx ring size. To coalesce, we need host buffers to store the number of packets we try to coalesce plus others in flight. Now that I think about this some more, our hardware may require more host buffers than other NICs to achieve no drops with the same traffic pattern. When the last host buffer is used up, the next packet that is ready for DMA to the host ring will be dropped immediately by firmware (rx_fw_discard). On a different vendor's NIC, under the same condition, these packets that arrive after the host ring is full may continue to be buffered in on-chip buffers. Our design is like this to avoid head-of-line blocking. If one RSS ring is full, other packets behind the next packet may be for other RSS rings or iSCSI that may have plenty of buffers. So we don't utilize the on-chip buffers when one ring is full. If flow control is enabled, then the firmware will not immediately drop packets when one ring is full. Flow control by nature causes head-of-line blocking so firmware will queue the packets and allow pause to be generated. When packet sizes are small, the situation is worse as on-chip buffers can buffer a large number of these small packets when the host CPU is temporarily behind. So at least in theory, everything seems to make sense. The only way to really confirm this is profile the system to see if successive NAPI intervals are really long enough to cause ring overflow at ring size 1020. As I said before, I tried to do that with a debug patch but it wasn't very successful.
So do we no longer think that they should confirm the issue by looking for drops on their e1000 nics? I have a call with them in just under an hour and they are looking for an action plan as it seems to them that we have the issue resolved.
I think comparing with e1000 will be useful. We can find out from published specs how much on-chip buffers they have and account for any differences if any. But we can just as well do these experiments ourselves at Broadcom if the customer doesn't want to do proceed further with more tests. I'd like to hear Andy's and John's opinions.
Before we move on to increasing the ring buffers, there was a question about posed in comment #15 about type of traffic, specifically was there multicast. >Out of curiosity, can you also let me know if multicast traffic is the primary >traffic coming into the box for this application? If so, how many multicast >groups are used? How many different user-space processes are open and >listening on sockets receiving that traffic? >If the traffic is not multicast, is it primarily unicast TCP or UDP? I looked at the ethtool data provided in one of the first comments (ethtool_output.tar.gz) and noticed that only eth3 had rx_fw_discards greater than 0 so I used eth3's info. The number of discards was: ethtool_-S_eth3: rx_fw_discards: 10699 From what I see eth3 had the following breakdown, rx_ucast_packets: 871294266 rx_mcast_packets: 14348 rx_bcast_packets: 6504710 Can I assume this is correctly represents the problem? With regard to the test kernel rpms, I assume there have been no problems reported. It did have a patch to fix a napi problem in addition to the increase in rx ring buffer max so I just want to officially ask for confirmation.
Correct, no issues reported since they put it in last Sunday.
(In reply to comment #25) > Before we move on to increasing the ring buffers, there was a question about > posed in comment #15 about type of traffic, specifically was there multicast. > > >Out of curiosity, can you also let me know if multicast traffic is the primary > >traffic coming into the box for this application? If so, how many multicast > >groups are used? How many different user-space processes are open and > >listening on sockets receiving that traffic? > > >If the traffic is not multicast, is it primarily unicast TCP or UDP? > > > I looked at the ethtool data provided in one of the first comments > (ethtool_output.tar.gz) and noticed that only eth3 had rx_fw_discards greater > than 0 so I used eth3's info. > > The number of discards was: > ethtool_-S_eth3: rx_fw_discards: 10699 > > From what I see eth3 had the following breakdown, > rx_ucast_packets: 871294266 > rx_mcast_packets: 14348 > rx_bcast_packets: 6504710 > > Can I assume this is correctly represents the problem? > > With regard to the test kernel rpms, I assume there have been no problems > reported. It did have a patch to fix a napi problem in addition to the increase > in rx ring buffer max so I just want to officially ask for confirmation. Thanks, John. There is definitely a case where a large number of multicast listeners could cause netif_receive_skb to slow down enough that we would have issues with line-rate traffic. There is still a chance that the sink for this traffic causes netif_receive_skb to run for longer than we expect, but I don't know if we will be able to gather that information in this case.
Coalesce settings for bnx2 default: # ethtool -c eth2 Coalesce parameters for eth2: Adaptive RX: off TX: off stats-block-usecs: 999936 sample-interval: 0 pkt-rate-low: 0 pkt-rate-high: 0 rx-usecs: 18 rx-frames: 12 rx-usecs-irq: 18 rx-frames-irq: 2 tx-usecs: 80 tx-frames: 20 tx-usecs-irq: 18 tx-frames-irq: 2 # ethtool -C eth2 rx-usecs 8 rx-usecs-irq 8 rx-frames 0 rx-frames-irq 0 suggested: # ethtool -c eth2 Coalesce parameters for eth2: Adaptive RX: off TX: off stats-block-usecs: 999936 sample-interval: 0 pkt-rate-low: 0 pkt-rate-high: 0 rx-usecs: 8 rx-frames: 0 rx-usecs-irq: 8 rx-frames-irq: 0 tx-usecs: 80 tx-frames: 20 tx-usecs-irq: 18 tx-frames-irq: 2
Andy, You mentioned on our call that there were specific data points that you wanted oprofile to capture, could you document those here so that I can get an action plan together? Thanks! -Guil
Dave Miller decided to take Michael's maximum RX ring buffer increase upstream.
I build some test kernels that have the NAPI fixes as well as the increased ring-buffer size. They can be found here: http://people.redhat.com/agospoda/#rhel5
Thanks Andy, can we push that increased ring buffer change to the GA kernels?
(In reply to comment #32) > Thanks Andy, can we push that increased ring buffer change to the GA kernels? Yes. I do not want to take that patch without taking the NAPI changes. Both patches are available here: http://people.redhat.com/agospoda/rhel5/0248-bnx2-fixup-broken-NAPI-accounting.patch http://people.redhat.com/agospoda/rhel5/0249-bnx2-Increase-max-rx-ring-size-from-1K-to-2K.patch
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
in kernel-2.6.18-230.el5 You can download this test kernel (or newer) from http://people.redhat.com/jwilson/el5 Detailed testing feedback is always welcomed.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2011-0017.html