Description of problem: strange i/o network+disk throughput on windows VMs unless you put this in modprobe.conf on dom0: options bnx2x disable_tpa=1 Version-Release number of selected component (if applicable): bnx2x module version 1.45.23q kernel-xen version2.6.18-128.4.1.el5xen How reproducible: we have 10 blades and this error is like clockwork. Steps to Reproduce: 1. have an hp bl460c g6 with bnx2x passthrough ethernet 2. boot a xen kernel or try the kvm 5.4 beta 3. create a domU windows server guest - 2008 or 2003, 32b or 64b 4. from the windows server guest, download something Actual results: nearly no network throughput on guest. nothing good happens, throughput of 1-2kb/s until network eventually stalls. traffic that doesn't incur disk i/o happens normally -- you can RDP to the virtual guest and it behaves normally. the system will never stop responding to pings. If you download something to that guest, and incur net+disk i/o, the network stalls. Expected results: normal virtualized network throughput Additional info: 'options bnx2x disable_tpa=1' in modprobe.conf fixes this issue.
Thanks for the report. When you say: 1. have an hp bl460c g6 with bnx2x passthrough ethernet What does that mean, exactly? Are you doing PCI passthrough of the device to the guest? Or are you just bridging the device, and connecting the guest into the bridge? Chris Lalancette
it's a description of the hardware on the hp 7000 blade system that hosts the bl460c blade server. we're bridging the device and connecting the guest into the bridge.
I'm seeing this problem aswell. The host has bnx2x NIC, and it's running RHEL 5.4 with Xen, currently with -173 dzickus el5xen test kernel. Xen PV guests using virbr0 nat/dhcp bridge have _really_ slow networking, like 1 - 5 kB/sec. It's impossible to install new guests using virt-install because of this. Dom0 networking is OK/fast. Host/dom0 dmesg has huge amounts of: <name_of_the_outbound_bridge>: received packets cannot be forwarded while LRO is enabled Adding "options bnx2x disable_tpa=1" to /etc/modprobe.conf and rebooting fixes the problem for guest vms.
this also occurs with KVM and linux guests.
Comment #3 claims that Dom0 networking is okay. This is wrong. iperf output for the machine running 2.6.18-164.6.1.el5: [root@minos ~]# iperf -c buildvirt-01 -p 80 ------------------------------------------------------------ Client connecting to buildvirt-01, TCP port 80 TCP window size: 16.0 KByte (default) ------------------------------------------------------------ [ 3] local 10.150.82.81 port 58780 connected with 10.147.103.16 port 80 [ ID] Interval Transfer Bandwidth [ 3] 0.0-10.1 sec 87.2 MBytes 72.8 Mbits/sec iperf output for the machine running 2.6.18-164.6.1.el5.xen: [root@minos ~]# iperf -c buildvirt-01 -p 80 ------------------------------------------------------------ Client connecting to buildvirt-01, TCP port 80 TCP window size: 16.0 KByte (default) ------------------------------------------------------------ [ 3] local 10.150.82.81 port 57288 connected with 10.147.103.16 port 80 [ ID] Interval Transfer Bandwidth [ 3] 0.0-10.6 sec 176 KBytes 136 Kbits/sec
When disabling tpa: [root@minos ~]# iperf -c buildvirt-01 -p 80 ------------------------------------------------------------ Client connecting to buildvirt-01, TCP port 80 TCP window size: 16.0 KByte (default) ------------------------------------------------------------ [ 3] local 10.150.82.81 port 38737 connected with 10.147.103.16 port 80 [ ID] Interval Transfer Bandwidth [ 3] 0.0-10.0 sec 66.2 MBytes 55.3 Mbits/sec [root@minos ~]#
The differences between 128 and 164-based kernels is pretty interesting. The 128-kernels have a bug where they do not appropriately enforce the limits that would normally be enforced by the napi weight. That should explain some of the increased throughput, but I would be shocked if the interrupt load was so great that the performance was ~3X better just because we were consuming 300 frames instead of 64 with each napi poll.
To confirm or deny that performance degradation is because of napi limit we can just set it to 300 as it was before i.e: echo 300 > /sys/class/net/ethX/weight
TPA unfortunately is a form of LRO. Unlike GRO, it is fundamentally incompatible with bridging. So it has to be disabled when bridging is in use.
This bug needs clarification in it if there is also a performance regression beyond what is corrected by disabling TPA. Or in other words, does disabling TPA resolve all issues for this bug? If not, we should open a new bug with the regression details. Thanks, Andrew
(In reply to comment #17) > This bug needs clarification in it if there is also a performance regression > beyond what is corrected by disabling TPA. Or in other words, does disabling > TPA resolve all issues for this bug? Customer is seeing regression, some more details are in Comment 7. On my testing environment I did not see regression, however I only changed kernel and do tests with RHEL5.4 userland. Perhaps with full RHEL5.3 installation we can see regression as well, I'm going to check that. > If not, we should open a new bug with the regression details. Agree.
Just curious -- any good reason that TPA needs to be enabled by default?
For performance reasons :) on non virtual machines. Probably option disable_tpa=1 should be added to modprobe.d as default when bridges are used by installation program. If not we can eventually do something like that in bnx2x driver: #ifdef XEN static int disable_tpa = 1; #else static int disable_tpa = 0; #endif But this does not look like good solution, for example what about KVM and RHEL6?
(In reply to comment #22) > #ifdef XEN > static int disable_tpa = 1; > #else > static int disable_tpa = 0; > #endif > > But this does not look like good solution, for example what about KVM and > RHEL6? Ok, there is better way. We can backport commit commit 0187bdfb05674147774ca79a79942537f3ad54bd Author: Ben Hutchings <bhutchings> Date: Thu Jun 19 16:15:47 2008 -0700 net: Disable LRO on devices that are forwarding That will be the best solution since this is the way it is done upstream.
(In reply to comment #25) > (In reply to comment #22) > > #ifdef XEN > > static int disable_tpa = 1; > > #else > > static int disable_tpa = 0; > > #endif > > > > But this does not look like good solution, for example what about KVM and > > RHEL6? > > Ok, there is better way. We can backport commit > > commit 0187bdfb05674147774ca79a79942537f3ad54bd > Author: Ben Hutchings <bhutchings> > Date: Thu Jun 19 16:15:47 2008 -0700 > > net: Disable LRO on devices that are forwarding > > That will be the best solution since this is the way it is done upstream. That exact patch is a kABI breaker (the ethtool set_flags bits). This is why we have not included it. Dave Miller had an interesting suggestion to create a way to allow a device to register if it is capable of LRO and when needed we could call down and disable it. This was taken from an email from Dave: Therefore we can implement the handling using whatever datastructures and interfaces we want. For example, we could have: typedef int (*lro_func_t)(struct net_device *, bool enable); int register_lro_netdev(struct net_device *dev, lro_func_t func); void unregister_lro_netdev(struct net_device *dev); and then a driver goes: int ret = register_netdevice(dev); if (ret) err_register; ret = register_lro_netdev(dev, mydev_lro_func); We maintain a simple linked list of LRO netdevs, and when bridging or routing wants to turn it off it calls some interface we provide like: struct netdev_lro_entry { struct list_head list; struct net_device *dev; lro_disable_func_t func; }; static struct list_head lro_netdevs; int netdev_lro_disable(struct net_device *dev) { struct netdev_lro_entry *p; int err = -ENODEV; list_for_each_entry(p, &lro_netdevs, list) { if (p->dev == dev) { err = p->func(dev, false); break; } } } EXPORT_SYMBOL(netdev_lro_disable); and there's an equivalent netdev_lro_enable().
(In reply to comment #26) > > commit 0187bdfb05674147774ca79a79942537f3ad54bd > > Author: Ben Hutchings <bhutchings> > > Date: Thu Jun 19 16:15:47 2008 -0700 > > > > net: Disable LRO on devices that are forwarding > > > > That will be the best solution since this is the way it is done upstream. > > That exact patch is a kABI breaker (the ethtool set_flags bits). This is why > we have not included it. Dave Miller had an interesting suggestion to create a > way to allow a device to register if it is capable of LRO and when needed we > could call down and disable it. Whay just not create additional ethtool_ops_ext structure? struct ethtool_ops_ext { struct ethtool_ops *ops; struct ethtool_aux *aux; }; Plus some bit int netdev->flags that indicate if driver use ehtool_ops_ext or ethtool_ops. This will allow to add other additional ethtools methods in the future.
(In reply to comment #27) > > Whay just not create additional ethtool_ops_ext structure? > > struct ethtool_ops_ext { > struct ethtool_ops *ops; > struct ethtool_aux *aux; > }; > > Plus some bit int netdev->flags that indicate if driver use ehtool_ops_ext or > ethtool_ops. This will allow to add other additional ethtools methods in the > future. I see that as a hack that should only be used when no other option exists.
(In reply to comment #28) > I see that as a hack that should only be used when no other option exists. It's not beauty but: 1) It's standard method in RHEL kernel, when want to add fields to structure and not break kABI, for example signal_with_aux_struct. 2) It's extensible, have it in place can easy add new ethtools methods. 3) Allow to have code close to upstream. For me this is better way over that Dave proposed.
(In reply to comment #29) > (In reply to comment #28) > > I see that as a hack that should only be used when no other option exists. > > It's not beauty but: > 1) It's standard method in RHEL kernel, when want to add fields to structure > and not break kABI, for example signal_with_aux_struct. > 2) It's extensible, have it in place can easy add new ethtools methods. > 3) Allow to have code close to upstream. > > For me this is better way over that Dave proposed. I'm well aware of how this can be used to hack around kABI limitations. My personal opinion is that I would rather see a small deviation from upstream than a kABI workaround like you have proposed. I do not like to see those used unless no other reasonable option exists.
Broadcom posted fixes two weeks ago to remove LRO and use GRO for bnx2x and David Miller has added them to net-next. I suggest we take that patch rather than focus energy finding creative ways to disable LRO on this driver. commit 4fd89b7af28292e190650b9b9bc4308658d81dd1 Author: Dmitry Kravkov <dmitry> Date: Thu Apr 1 19:45:34 2010 -0700 bnx2x: Added GRO support
Actually, that patch just adds GRO on top of LRO and if LRO is still active, LRO is still the one that will be used. This is because our LRO is HW/FW based (TPA) and it is much better (about double) than the SW GRO solution. Please see more information in Bug 573114
Thanks, Eilon. Though your hardware LRO is still used (as is the case with other network drivers that now support GRO), this does get around the problems LRO has when asked to be in a forwarding device, right?
*** This bug has been marked as a duplicate of bug 582367 ***
Bug 582367 really contains the answer to this question. With the enhancement from that bug, the user does not need to manually disable LRO. Without it, the user should disable LRO so just GRO will be used
Here are kernel packages with GRO and auto disable LRO for bnx2x, if someone is interested in testing: http://people.redhat.com/sgruszka/rhel5/bz573114/
Stanislaw, I tested your xen kernel on RHEL5.5 x86_64. The hardware: HP BL460c + bnx2x (Virtual connect - Broadcom Corporation NetXtreme II BCM57711E 10-Gigabit PCIe). [root@hsl0000 ~]# uname -a Linux hsl0000.domain.local 2.6.18-197.el5.bnx2x_testxen #1 SMP Wed Apr 28 08:58:56 EDT 2010 x86_64 x86_64 x86_64 GNU/Linux Everything works like a charm now.