Description of problem: After spawning 100 guests on the same tenant network, and attempting to ping all the guests from a single VM, we begin seeing "no route to host" errors. To fix this, we can go into the VM that we could not ping, then ping the dhcp-namespace, then pings return. Version-Release number of selected component (if applicable): python-openvswitch-2.3.2-1.git20150730.el7_1.noarch openstack-neutron-openvswitch-2015.1.1-3.el7ost.noarch openvswitch-2.3.2-1.git20150730.el7_1.x86_64 How reproducible: 100% Steps to Reproduce: 1. Create OSP Environment w/ VXLAN 2. Launch 100 guests within a tenant network 3. Attempt to ping all the guests from a single guest Actual results: http://perf1.perf.lab.eng.bos.redhat.com/jtaleric/OpenStack/Random/100-run-update-to-ovs-run7.out Expected results: http://perf1.perf.lab.eng.bos.redhat.com/jtaleric/OpenStack/Random/100-run-update-to-ovs-run6.out Additional info: After many attempts, we would have a single attempt succeed. There seems to be a issue with arp, but it is difficult to pinpoint.
s/ping/ssh - Ping would fail too**
I guess it's arp tables size related?, I'd look into that and rise the sysctl limits. Bear in mind that we traverse iptables, so it's not only OVS ARP tables size.
This might be related to a overflow in the CPU's backlog queue. Every time OVS needs a broadcast (specially for MAC learning), it will send copies to all available ports. If you have more broadcasts going, the CPU backlog queue can overflow (default is 1000) and that leads to incomplete ARP resolutions for random IP addresses. So, if you try few times consecutively, you will see the problem moving or even not happening. Please try to bump sysctl netdev_max_backlog to 10000 and see if that helps out. Thanks, fbl
I bumped the netdev_max_backlog across all machines. This didn't seem to have any impact. Here is a interesting observation : http://perf1.perf.lab.eng.bos.redhat.com/jtaleric/OpenStack/BAGL/ping-with-retry2.out What is interesting to note, is I ping the router ip, which I would think is totally useless... However, if I did not have that line, the results look like this*: http://perf1.perf.lab.eng.bos.redhat.com/jtaleric/OpenStack/BAGL/ping-with-retry.out *Note I did add more output, this script only printed failures at first. This script is simply using the namespace to reach each guest: for i in `nova list | grep ACTIVE | awk '{print $12}' | cut -d "=" -f2`; do count=$(ip netns exec qrouter-2503ad84-c7f5-4251-b0c9-6c5ecd7c5dba ping -c 3 $i | grep 'received' | awk -F',' '{print $2}' | awk '{print $1}') if [ $count -eq 0 ]; then echo "------------------------------ Host $i - Failed -- Retrying --------------------------------------------" ping_somewhere_else=$(ip netns exec qrouter-2503ad84-c7f5-4251-b0c9-6c5ecd7c5dba ping -c 3 10.0.0.1) count=$(ip netns exec qrouter-2503ad84-c7f5-4251-b0c9-6c5ecd7c5dba ping -c 3 $i | grep 'received' | awk -F',' '{print $2}' | awk '{print $1}') if [ $count -eq 0 ]; then echo "-------------------- Host $i - Failed 2nd attempt --------------------------" else echo "------------ Host $i - 2nd attempt worked..--------------------" fi else echo "Host $i worked..." fi done
Also, tunnel type does not matter, we are running with GRE and it still occurs.
You need to bump the sysctl before start OVS otherwise it will remain using the cached value read at the initialization. Have you restarted OVS or started OVS after bumping the sysctl value? Also look at 'ovs-dpctl show' output and see if 'lost' is increasing. Thanks, fbl
fbl- This did not help. At first we thought it might of, and I increased to 100000, yet we are still seeing issues.. Output from test (note lost & we had failures) # ovs-dpctl show system@ovs-system: lookups: hit:349839642 missed:17559558 lost:304 flows: 289 masks: hit:1654487698 total:8 hit/pkt:4.50 port 0: ovs-system (internal) port 1: br-vlan (internal) port 2: br-ex (internal) port 3: vlan10 (internal) port 4: vlan302 (internal) port 5: p2p2 port 6: p2p1 port 7: gre_system (gre: df_default=false, ttl=0) port 8: vlan301 (internal) port 9: vlan303 (internal) port 10: vlan304 (internal) port 11: em1 port 12: br-int (internal) port 13: br-tun (internal) port 14: bond1 (internal) port 15: tap3dcef867-7d (internal) port 16: ha-3fb75f94-c6 (internal) port 17: qr-02e5b124-a4 (internal) port 18: qg-65ff0210-af (internal) root@overcloud-controller-2:/home/heat-admin # ./ping.sh | tee full-run.out8-100000 Host 10.0.0.101 worked... ------------------------------ Host 10.0.0.6 - Failed -- Retrying -------------------------------------------- ------------ Host 10.0.0.6 - 2nd attempt worked..-------------------- ------------------------------ Host 10.0.0.111 - Failed -- Retrying -------------------------------------------- ------------ Host 10.0.0.111 - 2nd attempt worked..-------------------- Host 10.0.0.109 worked... Host 10.0.0.123 worked... .... Output removed. .... Host 10.0.0.42 worked... Host 10.0.0.33 worked... Host 10.0.0.84 worked... Host 10.0.0.70 worked... Host 10.0.0.86 worked... Host 10.0.0.64 worked... Host 10.0.0.134 worked... root@overcloud-controller-2:/home/heat-admin # ovs-dpctl show system@ovs-system: lookups: hit:351378416 missed:17652254 lost:304 flows: 287 masks: hit:1668154550 total:7 hit/pkt:4.52 port 0: ovs-system (internal) port 1: br-vlan (internal) port 2: br-ex (internal) port 3: vlan10 (internal) port 4: vlan302 (internal) port 5: p2p2 port 6: p2p1 port 7: gre_system (gre: df_default=false, ttl=0) port 8: vlan301 (internal) port 9: vlan303 (internal) port 10: vlan304 (internal) port 11: em1 port 12: br-int (internal) port 13: br-tun (internal) port 14: bond1 (internal) port 15: tap3dcef867-7d (internal) port 16: ha-3fb75f94-c6 (internal) port 17: qr-02e5b124-a4 (internal) port 18: qg-65ff0210-af (internal)
The tunnel traffic is going across a LACP bond : root@overcloud-controller-1:/home/heat-admin # ovs-appctl bond/show bond1 ---- bond1 ---- bond_mode: balance-tcp bond may use recirculation: yes, Recirc-ID : 300 bond-hash-basis: 0 updelay: 0 ms downdelay: 0 ms next rebalance: 1501 ms lacp_status: negotiated active slave mac: 90:e2:ba:20:e0:cc(p2p1)
Watching ovs in /var/log/messages on one of my controllers. Oct 6 09:18:12 overcloud-controller-1 kernel: openvswitch: ovs-system: deferred action limit reached, drop recirc action Oct 6 09:18:12 overcloud-controller-1 kernel: openvswitch: ovs-system: deferred action limit reached, drop recirc action Oct 6 09:18:12 overcloud-controller-1 kernel: openvswitch: ovs-system: deferred action limit reached, drop recirc action Oct 6 09:18:12 overcloud-controller-1 kernel: openvswitch: ovs-system: deferred action limit reached, drop recirc action Oct 6 09:18:12 overcloud-controller-1 kernel: openvswitch: ovs-system: deferred action limit reached, drop recirc action Oct 6 09:18:12 overcloud-controller-1 kernel: openvswitch: ovs-system: deferred action limit reached, drop recirc action Oct 6 09:18:12 overcloud-controller-1 kernel: openvswitch: ovs-system: deferred action limit reached, drop recirc action Oct 6 09:18:12 overcloud-controller-1 kernel: openvswitch: ovs-system: deferred action limit reached, drop recirc action Oct 6 09:18:12 overcloud-controller-1 kernel: openvswitch: ovs-system: deferred action limit reached, drop recirc action
I see the same messages on the compute nodes : Oct 6 09:20:25 overcloud-compute-0 kernel: openvswitch: ovs-system: deferred action limit reached, drop recirc action Oct 6 09:20:25 overcloud-compute-0 kernel: openvswitch: ovs-system: deferred action limit reached, drop recirc action Oct 6 09:20:25 overcloud-compute-0 kernel: openvswitch: ovs-system: deferred action limit reached, drop recirc action Oct 6 09:20:25 overcloud-compute-0 kernel: openvswitch: ovs-system: deferred action limit reached, drop recirc action Oct 6 09:20:25 overcloud-compute-0 kernel: openvswitch: ovs-system: deferred action limit reached, drop recirc action Oct 6 09:20:25 overcloud-compute-0 kernel: openvswitch: ovs-system: deferred action limit reached, drop recirc action Oct 6 09:20:25 overcloud-compute-0 kernel: openvswitch: ovs-system: deferred action limit reached, drop recirc action Oct 6 09:20:25 overcloud-compute-0 kernel: openvswitch: ovs-system: deferred action limit reached, drop recirc action Oct 6 09:20:25 overcloud-compute-0 kernel: openvswitch: ovs-system: deferred action limit reached, drop recirc action Oct 6 09:20:25 overcloud-compute-0 kernel: openvswitch: ovs-system: deferred action limit reached, drop recirc action
https://gist.githubusercontent.com/jtaleric/71859895d0a2cf761dfb/raw/653a4fa762c8f2471f3f8ef1a57bfb1cfdc81e0d/gistfile1.txt ^ Dropwatch with the pings failing.
for i in `nova list|grep overcloud|awk '{print $12}' | cut -d= -f 2`; do ssh heat-admin@$i sudo sysctl -w net.ipv4.neigh.default.gc_thresh3=8192; ssh heat-admin@$i sudo sysctl -w net.ipv4.neigh.default.gc_thresh2=8192; ssh heat-admin@$i sudo sysctl -w net.ipv4.neigh.default.gc_thresh1=4096; done Also bumped the gc_thresh across my env.
http://perf1.perf.lab.eng.bos.redhat.com/jtaleric/OpenStack/Random/controller-dpctl-flows.out ^ looking at the active flows. At any given time, we have > 200 flows.
Removing the bond, I no longer see the "overcloud-compute-0 kernel: openvswitch: ovs-system: deferred action limit reached, drop recirc action" message. Also, the ping test succeeds across 100 guests.
Nice find Joe. Question is, do we need the Bonded interface in this config?
fbl made a suggestion to try a different bond mode : Go from : bond_mode=balance-tcp lacp=active other-config:lacp-fallback-ab=true other-config:lacp-time=fast To bond_mode=active-backup To see if this helps..
Active/Back up seems to clean up some of these problems.
Joe, Thanks very much for your testing so far. At this point I think the problem is very likely related to the way OVS bond implements LACP mode (using recirculation). Active-backup doesn't use that, so it works. The next step is actually do dig deeper into the OVS bond implementation to see what is going on. Thanks, fbl
(In reply to Joe Talerico from comment #14) > http://perf1.perf.lab.eng.bos.redhat.com/jtaleric/OpenStack/Random/ > controller-dpctl-flows.out > > ^ looking at the active flows. > > At any given time, we have > 200 flows. External link: https://gist.github.com/jtaleric/751086110b0394c94532
Upstream discussion: http://openvswitch.org/pipermail//discuss/2015-October/019163.html
(In reply to Flavio Leitner from comment #22) > Upstream discussion: > http://openvswitch.org/pipermail//discuss/2015-October/019163.html Just checking in on this bug. Were the suggested patches from the upstream discussion ever tested? Do we know if a fix was ever adopted in the net tree of the OVS kernel module? Although we have a workaround currently by using Linux kernel bond mode for LACP traffic, there are some situations where we would like the flexibility to use OVS bonds. It would be nice to be able to support LACP bonds with OVS in RDO/OSP in the future, so I'd like to track when the OVS kernel modules are fixed.
(In reply to Dan Sneddon from comment #23) I can't answer the question of whether the suggested patches from the upstream discussion were ever tested, but neither appears to have made it into the upstream kernel (net-next) or upstream OVS (current master).
To give some additional background on this bug: The original design goal behind using OVS as the bonding mechanism in TripleO was simplicity: since OVS was being used for bridging, adding bonding support directly to the OVS bridge made sense. Unfortunately, in a few large-scale deployments with high traffic, packet loss has been observed in OVS bonds using LACP. To counteract this behavior, support was added to TripleO to support Linux kernel-mode bonds, which are not susceptible to packet loss at scale. When using OVS bonds, an OVS bridge is required. This necessitated the use of a bridge even on the storage nodes, where this wouldn't be required if using kernel bonds. Now that we have the option of using kernel-mode bonds, we recommend this in most cases, unless an OVS-specific bonding mode is required (although different packet loss issues have occurred when using balance-slb mode bonds, see https://bugzilla.redhat.com/show_bug.cgi?id=1289962). Work is underway to provide additional sample templates for Linux kernel-mode bonding, but the following rules apply: * OVS bridges must be used for interfaces used for Neutron tenant or provider VLANs, or for SNAT/Floating IPs, whether or not bonding is used on a particular interface. * Bond interfaces use "linux_bond" instead of "ovs_bond" for the type: * Templates use "bonding_options" instead of "ovs_options" for the bond options * Linux kernel bonding options have a different form, such as "mode=802.3ad" to indicate LACP bonding. For more info, see https://www.kernel.org/doc/Documentation/networking/bonding.txt
Created attachment 1148801 [details] Linux Bond NIC Templates
Not that I have much of a say, but +1 for kernel-mode bonds.
Hi Jean, This is something we could try to add to our OVS Bond QE scripts. Two hosts connected using OVS bond LACP (balance-tcp), with vxlan tunnel and then create 100 netns on each side with valid IP addresses. Then on a single NS, ping all the other ones. If one fail, you have reproduced this issue. Host A Host B OVS OVS / \ / \ netns 0..99 VXLAN VXLAN netns 0..99 port 0 ----------- port 0 Bond LACP Bond LACP port 1 ----------- port 1 netns#0 should be able to ping all other local and remote netns. Thanks, fbl
(In reply to Flavio Leitner from comment #33) > Hi Jean, > > This is something we could try to add to our OVS Bond QE scripts. > Two hosts connected using OVS bond LACP (balance-tcp), with vxlan tunnel and > then create 100 netns on each side with valid IP addresses. Then on a > single NS, ping all the other ones. If one fail, you have reproduced this > issue. > > Host A Host B > OVS OVS > / \ / \ > netns 0..99 VXLAN VXLAN netns 0..99 > port 0 ----------- port 0 > Bond LACP Bond LACP > port 1 ----------- port 1 > > netns#0 should be able to ping all other local and remote netns. > > Thanks, > fbl Hi Flavio Sure, I will give a try. Some questions: Using ovs-2.5.0-3 --- non-dpdk ovs? \ Using latest RHEL7.3 kernel like 447 ? Or 7.2.z like 327.13.1 ? Thanks! Jean
(In reply to Jean-Tsung Hsiao from comment #34) > Using ovs-2.5.0-3 --- non-dpdk ovs? \ non-DPDK OVS. > Using latest RHEL7.3 kernel like 447 ? Or 7.2.z like 327.13.1 ? If there is an issue with 7.2, we might need to fix in 7.2.z as proposed already in this bz, so 7.2.z bits seems a good starting point. But we would need to do the same for RHEL-7.3 bits later on.
(In reply to Flavio Leitner from comment #35) > (In reply to Jean-Tsung Hsiao from comment #34) > > Using ovs-2.5.0-3 --- non-dpdk ovs? \ > non-DPDK OVS. > > > Using latest RHEL7.3 kernel like 447 ? Or 7.2.z like 327.13.1 ? > If there is an issue with 7.2, we might need to fix in 7.2.z as proposed > already in this bz, so 7.2.z bits seems a good starting point. > > But we would need to do the same for RHEL-7.3 bits later on. Under kernel 447, and 125 name spaces on each side all 250 pings passed. So, I cannot reproduce the issue --- no route error. Will try 327.13.1 next.
(In reply to Flavio Leitner from comment #35) > (In reply to Jean-Tsung Hsiao from comment #34) > > Using ovs-2.5.0-3 --- non-dpdk ovs? \ > non-DPDK OVS. > > > Using latest RHEL7.3 kernel like 447 ? Or 7.2.z like 327.13.1 ? > If there is an issue with 7.2, we might need to fix in 7.2.z as proposed > already in this bz, so 7.2.z bits seems a good starting point. > > But we would need to do the same for RHEL-7.3 bits later on. Hi Flavio, Under 7.2.z there is no such issue either.
FYI bz#1366681 reproduced the same issue. The analysis is done at https://bugzilla.redhat.com/show_bug.cgi?id=1366681#c51 The fix were proposed in upstream as RFC: http://openvswitch.org/pipermail/dev/2016-March/067895.html http://openvswitch.org/pipermail/dev/2016-March/067794.html
Note that the fix was rejected by David Miller upstream: http://openvswitch.org/pipermail/dev/2016-March/068046.html I think it's because of a misunderstanding: the description suggests it's a per-packet limit (and then David's comment would make complete sense) but in fact, it's a per-CPU limit, shared by all packets processed on the CPU.
(In reply to Jiri Benc from comment #39) > Note that the fix was rejected by David Miller upstream: > http://openvswitch.org/pipermail/dev/2016-March/068046.html > > I think it's because of a misunderstanding: the description suggests it's a > per-packet limit (and then David's comment would make complete sense) but in > fact, it's a per-CPU limit, shared by all packets processed on the CPU. Response added to thread: http://openvswitch.org/pipermail/dev/2016-August/078528.html
*** Bug 1289825 has been marked as a duplicate of this bug. ***
*** Bug 1366681 has been marked as a duplicate of this bug. ***
The fix for this issue has been accepted upstream, and is currently being targeted for RHEL 7.4 (Flavio Leitner is following up with PM regarding whether it should go into a 7.3 fixes stream). commit f43e6dfb056b58628e43179d8f6b59eae417754d Author: Lance Richardson <lrichard> Date: Mon Sep 12 17:07:23 2016 -0400 openvswitch: avoid deferred execution of recirc actions The ovs kernel data path currently defers the execution of all recirc actions until stack utilization is at a minimum. This is too limiting for some packet forwarding scenarios due to the small size of the deferred action FIFO (10 entries). For example, broadcast traffic sent out more than 10 ports with recirculation results in packet drops when the deferred action FIFO becomes full, as reported here: http://openvswitch.org/pipermail/dev/2016-March/067672.html Since the current recursion depth is available (it is already tracked by the exec_actions_level pcpu variable), we can use it to determine whether to execute recirculation actions immediately (safe when recursion depth is low) or defer execution until more stack space is available. With this change, the deferred action fifo size becomes a non-issue for currently failing scenarios because it is no longer used when there are three or fewer recursions through ovs_execute_actions(). Suggested-by: Pravin Shelar <pshelar>
What is the status of this defect? It seems that the reference to this defect is no longer in RHOSP 10 director documentation, while it was still in RHOSP 8 and 9. Would anybody able to comment if this defect was fixed in some version / kernel / ovs package of RHEL 7.x?
(In reply to Miro Halas from comment #47) > What is the status of this defect? It seems that the reference to this > defect is no longer in RHOSP 10 director documentation, while it was still > in RHOSP 8 and 9. > > Would anybody able to comment if this defect was fixed in some version / > kernel / ovs package of RHEL 7.x? The fix for this issue will be in RHEL 7.4, and is not currently a candidate for the RHEL 7.3 fixes stream. Here is the BZ for the kernel fix: https://bugzilla.redhat.com/show_bug.cgi?id=1370643
Could someone confirm this will no longer be an issue with RHEL 7.4 and we can revert to using balance-tcp with that update? Thanks
This is fixed in RHEL 7.4 and the RHEL 7.3 fixes stream, the fixed kernel versions are the associated BZs: https://bugzilla.redhat.com/show_bug.cgi?id=1370643 https://bugzilla.redhat.com/show_bug.cgi?id=1388592 Note that, as far as I know, this has not been verified in an OpenStack environment.
(In reply to Lance Richardson from comment #51) > This is fixed in RHEL 7.4 and the RHEL 7.3 fixes stream, the fixed kernel > versions are the associated BZs: > > https://bugzilla.redhat.com/show_bug.cgi?id=1370643 > https://bugzilla.redhat.com/show_bug.cgi?id=1388592 > > Note that, as far as I know, this has not been verified in an OpenStack > environment. Thanks but I can't access those bugs and the last indication was that it wasn't being fixed in RHEL 7.3. Good to know that it has been - perhaps you could make the version numbers for packages public? Its probably a hard one to replicate in testing...
(In reply to Christopher Brown from comment #52) > (In reply to Lance Richardson from comment #51) > > This is fixed in RHEL 7.4 and the RHEL 7.3 fixes stream, the fixed kernel > > versions are the associated BZs: > > > > https://bugzilla.redhat.com/show_bug.cgi?id=1370643 > > https://bugzilla.redhat.com/show_bug.cgi?id=1388592 > > > > Note that, as far as I know, this has not been verified in an OpenStack > > environment. > > Thanks but I can't access those bugs and the last indication was that it > wasn't being fixed in RHEL 7.3. Good to know that it has been - perhaps you > could make the version numbers for packages public? > > Its probably a hard one to replicate in testing... Upstream commit: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=2679d040412df847d390a3a8f0f224a7c91f7fae Initial RHEL kernel versions containing the fix: kernel-3.10.0-519.el7 (7.4 stream) kernel-3.10.0-514.1.1.el7 (7.3 fixes stream)
Reassigning to Nir, per suggestion by Flavio Leitner. This issue is fixed in RHEL 7.4 and 7.3z kernels, ready to be picked up by layered products.
Fixed in 7.3.z kernel: https://bugzilla.redhat.com/show_bug.cgi?id=1388592