| Summary: | Slow traffic between instances 1Gbps | ||
|---|---|---|---|
| Product: | Red Hat OpenStack | Reporter: | Robin Cernin <rcernin> |
| Component: | openstack-neutron | Assignee: | Ihar Hrachyshka <ihrachys> |
| Status: | CLOSED NOTABUG | QA Contact: | Toni Freger <tfreger> |
| Severity: | high | Docs Contact: | |
| Priority: | high | ||
| Version: | 7.0 (Kilo) | CC: | akaris, amuller, ccharron, chrisw, ihrachys, nyechiel, pablo.iranzo, rcernin, skinjo, srevivo |
| Target Milestone: | async | ||
| Target Release: | 7.0 (Kilo) | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2017-05-22 14:37:01 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
|
Description
Robin Cernin
2016-10-17 16:56:50 UTC
Ihar, can you please look at the SOS report and see if there's any MTU related configuration issues that would explain low throughput between VMs on different compute nodes? Hi, This issue has come up on many occasions with many customers. Yes, increasing the MTU is a workaround that works by lowering the number of PPS across the VXLAN tunnels. However, increasing the MTU only works if we stay within the cluster. Once we are out in the wild, we don't control the MTU any more. The real problem here is that sending packets through the VXLAN tunnels of OVS creates high software interrupts and this only on one CPU. This is due to the fact that hardware offloading technologies such as GRO won't work for traffic within the tunnel Sending outside of the tunnel is fine, because the hardware can help the kernel to process a higher number of packets. I find it very difficult to explain to customers that the increase of MTU should be the solution when in reality it's only a way to lower the PPS, and thus is only a workaround, for an underlying problem. I think the recommendation to customers should be to buy certified / tested NICs with VXLAN offloading. (In reply to Andreas Karis from comment #9) > Hi, > > This issue has come up on many occasions with many customers. > > Yes, increasing the MTU is a workaround that works by lowering the number of > PPS across the VXLAN tunnels. However, increasing the MTU only works if we > stay within the cluster. Once we are out in the wild, we don't control the > MTU any more. The real problem here is that sending packets through the > VXLAN tunnels of OVS creates high software interrupts and this only on one > CPU. This is due to the fact that hardware offloading technologies such as > GRO won't work for traffic within the tunnel Sending outside of the tunnel > is fine, because the hardware can help the kernel to process a higher number > of packets. > > I find it very difficult to explain to customers that the increase of MTU > should be the solution when in reality it's only a way to lower the PPS, and > thus is only a workaround, for an underlying problem. I think the > recommendation to customers should be to buy certified / tested NICs with > VXLAN offloading. I don't consider jumbo frames and VVXLAN offloading a workaround. They're required to obtain line rate speeds and is the recommendation of our own performance team. I didn't say that VXLAN offloading was a workaround. But using jumbo frames effectively *is* a workaround. The problem is that once the packets go through the VXLAN tunnel, they are switched in software. We cannot switch the number of packets that we'd like in software, which is why the throughput goes significantly down. So we enable jumbo frames to lower the total number of PPS. Which in my opinion is a workaround. We can't achieve our goal = switching high number of packets, so we lower the number of packets instead. Regardless of that though, I agree with you that both jumboframes and VXLAN offloading should be recommended to customers. I'd just like to see some more recommendations / documentation for VXLAN offloading. Some customers are reluctant to implement the jumbo frame step, and I'd like to give them an alternative. Also note that I think that we do not have a "recommendation of our own performance team" with performance measurements and recommended NIC hardware that we can just forward to our customers. Or do we? Let's take a different example: what if the customer's goal was not total throughput, but PPS with packet sizes < 1500 Bytes? Increasing the MTU doesn't help at all in that case. I don't think we have any MTU related fixes and options that helped for Liberty+ setups. In Kilo, we only have network_device_mtu (on both neutron and nova sides), and it does not distinguish between network types. Sadly, I don't have access to sos reports, the directory does not exist. Could you please re-upload them? Depending on tenant network types used, we may try to make network_device_mtu option work. How was the cluster deployed? Was OSP Director used? If the documentation for VXLAN offloading is not present or in bad shape, please report a documentation bug. I think it's clear this and related customer cases are issues with hardware not picked correctly, and/or Jumbos not configured. There is not much engineers can do. If you still experience performance issues that you think may be related to how Neutron configures bridges and tap devices, please reopen with explanation. |