1267291 – [Openvswitch] balance-tcp bond mode causing issues with OSP Deployments

Bug 1267291 - [Openvswitch] balance-tcp bond mode causing issues with OSP Deployments

Summary: [Openvswitch] balance-tcp bond mode causing issues with OSP Deployments

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	openvswitch
Sub Component:
Version:	7.0 (Kilo)
Hardware:	All
OS:	All
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	8.0 (Liberty)
Assignee:	Nir Yechiel
QA Contact:	Eran Kuris
Docs Contact:
URL:
Whiteboard:
Duplicates (2):	1289825 1366681 (view as bug list)
Depends On:
Blocks:	1290377 1339488 1370643
TreeView+	depends on / blocked

Reported:	2015-09-29 14:45 UTC by Joe Talerico
Modified:	2020-08-13 08:18 UTC (History)
CC List:	34 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Clones:	1370643 (view as bug list)
Environment:
Last Closed:	2017-07-07 19:59:59 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Linux Bond NIC Templates (5.18 KB, application/x-gzip) 2016-04-19 23:51 UTC, Dan Sneddon	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Knowledge Base (Solution)	2525091	0	None	None	None	2019-06-16 08:29:35 UTC

Internal Links: 1310982 1329012 1741656

Description Joe Talerico 2015-09-29 14:45:35 UTC

Description of problem:
After spawning 100 guests on the same tenant network, and attempting to ping all the guests from a single VM, we begin seeing "no route to host" errors. To fix this, we can go into the VM that we could not ping, then ping the dhcp-namespace, then pings return.

Version-Release number of selected component (if applicable):
python-openvswitch-2.3.2-1.git20150730.el7_1.noarch
openstack-neutron-openvswitch-2015.1.1-3.el7ost.noarch
openvswitch-2.3.2-1.git20150730.el7_1.x86_64


How reproducible:
100%

Steps to Reproduce:
1. Create OSP Environment w/ VXLAN
2. Launch 100 guests within a tenant network
3. Attempt to ping all the guests from a single guest

Actual results:
http://perf1.perf.lab.eng.bos.redhat.com/jtaleric/OpenStack/Random/100-run-update-to-ovs-run7.out

Expected results:
http://perf1.perf.lab.eng.bos.redhat.com/jtaleric/OpenStack/Random/100-run-update-to-ovs-run6.out

Additional info:
After many attempts, we would have a single attempt succeed. There seems to be a issue with arp, but it is difficult to pinpoint.

Comment 2 Joe Talerico 2015-09-29 14:47:51 UTC

s/ping/ssh - Ping would fail too**

Comment 3 Miguel Angel Ajo 2015-09-29 14:55:49 UTC

I guess it's arp tables size related?, I'd look into that and rise the sysctl limits. Bear in mind that we traverse iptables, so it's not only OVS ARP tables size.

Comment 4 Flavio Leitner 2015-09-29 18:05:09 UTC

This might be related to a overflow in the CPU's backlog queue.

Every time OVS needs a broadcast (specially for MAC learning), it will send copies to all available ports.  If you have more broadcasts going, the CPU backlog queue can overflow (default is 1000) and that leads to incomplete ARP resolutions for random IP addresses.  So, if you try few times consecutively, you will see the problem moving or even not happening.

Please try to bump sysctl netdev_max_backlog to 10000 and see if that helps out.

Thanks,
fbl

Comment 5 Joe Talerico 2015-09-30 17:36:45 UTC

I bumped the netdev_max_backlog across all machines. This didn't seem to have any impact. Here is a interesting observation :

http://perf1.perf.lab.eng.bos.redhat.com/jtaleric/OpenStack/BAGL/ping-with-retry2.out

What is interesting to note, is I ping the router ip, which I would think is totally useless... However, if I did not have that line, the results look like this*:
http://perf1.perf.lab.eng.bos.redhat.com/jtaleric/OpenStack/BAGL/ping-with-retry.out

*Note I did add more output, this script only printed failures at first.

This script is simply using the namespace to reach each guest:

for i in `nova list | grep ACTIVE | awk '{print $12}' | cut -d "=" -f2`; do    
  count=$(ip netns exec qrouter-2503ad84-c7f5-4251-b0c9-6c5ecd7c5dba ping -c 3 $i | grep 'received' | awk -F',' '{print $2}' | awk '{print $1}') 
  if [ $count -eq 0 ]; then  
    echo "------------------------------ Host $i - Failed -- Retrying --------------------------------------------"  
    ping_somewhere_else=$(ip netns exec qrouter-2503ad84-c7f5-4251-b0c9-6c5ecd7c5dba ping -c 3 10.0.0.1)
    count=$(ip netns exec qrouter-2503ad84-c7f5-4251-b0c9-6c5ecd7c5dba ping -c 3 $i | grep 'received' | awk -F',' '{print $2}' | awk '{print $1}') 
    if [ $count -eq 0 ]; then 
      echo "-------------------- Host $i - Failed 2nd attempt --------------------------"
    else 
      echo "------------ Host $i - 2nd attempt worked..--------------------"
    fi 
  else 
    echo "Host $i worked..."
  fi
done

Comment 6 Joe Talerico 2015-09-30 19:05:45 UTC

Also, tunnel type does not matter, we are running with GRE and it still occurs.

Comment 7 Flavio Leitner 2015-09-30 21:09:06 UTC

You need to bump the sysctl before start OVS otherwise it will remain using the cached value read at the initialization.
Have you restarted OVS or started OVS after bumping the sysctl value?

Also look at 'ovs-dpctl show' output and see if 'lost' is increasing.

Thanks,
fbl

Comment 8 Joe Talerico 2015-10-01 21:15:23 UTC

fbl- This did not help. At first we thought it might of, and I increased to 100000, yet we are still seeing issues..

Output from test (note lost & we had failures)

# ovs-dpctl show
system@ovs-system:
        lookups: hit:349839642 missed:17559558 lost:304
        flows: 289
        masks: hit:1654487698 total:8 hit/pkt:4.50
        port 0: ovs-system (internal)
        port 1: br-vlan (internal)
        port 2: br-ex (internal)
        port 3: vlan10 (internal)
        port 4: vlan302 (internal)
        port 5: p2p2
        port 6: p2p1
        port 7: gre_system (gre: df_default=false, ttl=0)
        port 8: vlan301 (internal)
        port 9: vlan303 (internal)
        port 10: vlan304 (internal)
        port 11: em1
        port 12: br-int (internal)
        port 13: br-tun (internal)
        port 14: bond1 (internal)
        port 15: tap3dcef867-7d (internal)
        port 16: ha-3fb75f94-c6 (internal)
        port 17: qr-02e5b124-a4 (internal)
        port 18: qg-65ff0210-af (internal)

root@overcloud-controller-2:/home/heat-admin
# ./ping.sh | tee full-run.out8-100000
Host 10.0.0.101 worked...
------------------------------ Host 10.0.0.6 - Failed -- Retrying --------------------------------------------
------------ Host 10.0.0.6 - 2nd attempt worked..--------------------
------------------------------ Host 10.0.0.111 - Failed -- Retrying --------------------------------------------
------------ Host 10.0.0.111 - 2nd attempt worked..--------------------
Host 10.0.0.109 worked...
Host 10.0.0.123 worked...
....
Output removed.
....
Host 10.0.0.42 worked...
Host 10.0.0.33 worked...
Host 10.0.0.84 worked...
Host 10.0.0.70 worked...
Host 10.0.0.86 worked...
Host 10.0.0.64 worked...
Host 10.0.0.134 worked...
root@overcloud-controller-2:/home/heat-admin
# ovs-dpctl show
system@ovs-system:
        lookups: hit:351378416 missed:17652254 lost:304
        flows: 287
        masks: hit:1668154550 total:7 hit/pkt:4.52
        port 0: ovs-system (internal)
        port 1: br-vlan (internal)
        port 2: br-ex (internal)
        port 3: vlan10 (internal)
        port 4: vlan302 (internal)
        port 5: p2p2
        port 6: p2p1
        port 7: gre_system (gre: df_default=false, ttl=0)
        port 8: vlan301 (internal)
        port 9: vlan303 (internal)
        port 10: vlan304 (internal)
        port 11: em1
        port 12: br-int (internal)
        port 13: br-tun (internal)
        port 14: bond1 (internal)
        port 15: tap3dcef867-7d (internal)
        port 16: ha-3fb75f94-c6 (internal)
        port 17: qr-02e5b124-a4 (internal)
        port 18: qg-65ff0210-af (internal)

Comment 9 Joe Talerico 2015-10-05 22:00:46 UTC

The tunnel traffic is going across a LACP bond :

root@overcloud-controller-1:/home/heat-admin
# ovs-appctl bond/show bond1
---- bond1 ----
bond_mode: balance-tcp
bond may use recirculation: yes, Recirc-ID : 300
bond-hash-basis: 0
updelay: 0 ms
downdelay: 0 ms
next rebalance: 1501 ms
lacp_status: negotiated
active slave mac: 90:e2:ba:20:e0:cc(p2p1)

Comment 10 Joe Talerico 2015-10-06 13:19:03 UTC

Watching ovs in /var/log/messages on one of my controllers.

Oct  6 09:18:12 overcloud-controller-1 kernel: openvswitch: ovs-system: deferred action limit reached, drop recirc action
Oct  6 09:18:12 overcloud-controller-1 kernel: openvswitch: ovs-system: deferred action limit reached, drop recirc action
Oct  6 09:18:12 overcloud-controller-1 kernel: openvswitch: ovs-system: deferred action limit reached, drop recirc action
Oct  6 09:18:12 overcloud-controller-1 kernel: openvswitch: ovs-system: deferred action limit reached, drop recirc action
Oct  6 09:18:12 overcloud-controller-1 kernel: openvswitch: ovs-system: deferred action limit reached, drop recirc action
Oct  6 09:18:12 overcloud-controller-1 kernel: openvswitch: ovs-system: deferred action limit reached, drop recirc action
Oct  6 09:18:12 overcloud-controller-1 kernel: openvswitch: ovs-system: deferred action limit reached, drop recirc action
Oct  6 09:18:12 overcloud-controller-1 kernel: openvswitch: ovs-system: deferred action limit reached, drop recirc action
Oct  6 09:18:12 overcloud-controller-1 kernel: openvswitch: ovs-system: deferred action limit reached, drop recirc action

Comment 11 Joe Talerico 2015-10-06 13:20:49 UTC

I see the same messages on the compute nodes : 
Oct  6 09:20:25 overcloud-compute-0 kernel: openvswitch: ovs-system: deferred action limit reached, drop recirc action
Oct  6 09:20:25 overcloud-compute-0 kernel: openvswitch: ovs-system: deferred action limit reached, drop recirc action
Oct  6 09:20:25 overcloud-compute-0 kernel: openvswitch: ovs-system: deferred action limit reached, drop recirc action
Oct  6 09:20:25 overcloud-compute-0 kernel: openvswitch: ovs-system: deferred action limit reached, drop recirc action
Oct  6 09:20:25 overcloud-compute-0 kernel: openvswitch: ovs-system: deferred action limit reached, drop recirc action
Oct  6 09:20:25 overcloud-compute-0 kernel: openvswitch: ovs-system: deferred action limit reached, drop recirc action
Oct  6 09:20:25 overcloud-compute-0 kernel: openvswitch: ovs-system: deferred action limit reached, drop recirc action
Oct  6 09:20:25 overcloud-compute-0 kernel: openvswitch: ovs-system: deferred action limit reached, drop recirc action
Oct  6 09:20:25 overcloud-compute-0 kernel: openvswitch: ovs-system: deferred action limit reached, drop recirc action
Oct  6 09:20:25 overcloud-compute-0 kernel: openvswitch: ovs-system: deferred action limit reached, drop recirc action

Comment 12 Joe Talerico 2015-10-06 15:13:17 UTC

https://gist.githubusercontent.com/jtaleric/71859895d0a2cf761dfb/raw/653a4fa762c8f2471f3f8ef1a57bfb1cfdc81e0d/gistfile1.txt

^ Dropwatch with the pings failing.

Comment 13 Joe Talerico 2015-10-06 15:24:30 UTC

for i in `nova list|grep overcloud|awk '{print $12}' | cut -d= -f 2`; do  ssh heat-admin@$i sudo sysctl -w net.ipv4.neigh.default.gc_thresh3=8192; ssh heat-admin@$i sudo sysctl -w net.ipv4.neigh.default.gc_thresh2=8192; ssh heat-admin@$i sudo sysctl -w net.ipv4.neigh.default.gc_thresh1=4096; done

Also bumped the gc_thresh across my env.

Comment 14 Joe Talerico 2015-10-06 16:29:13 UTC

http://perf1.perf.lab.eng.bos.redhat.com/jtaleric/OpenStack/Random/controller-dpctl-flows.out

^ looking at the active flows.

At any given time, we have > 200 flows.

Comment 15 Joe Talerico 2015-10-07 16:30:27 UTC

Removing the bond, I no longer see the "overcloud-compute-0 kernel: openvswitch: ovs-system: deferred action limit reached, drop recirc action" message. 

Also, the ping test succeeds across 100 guests.

Comment 16 Rashid Khan 2015-10-07 17:43:09 UTC

Nice find Joe. 
Question is, do we need the Bonded interface in this config?

Comment 17 Joe Talerico 2015-10-07 18:13:47 UTC

fbl made a suggestion to try a different bond mode :
Go from : bond_mode=balance-tcp lacp=active other-config:lacp-fallback-ab=true other-config:lacp-time=fast 

To bond_mode=active-backup 

To see if this helps..

Comment 19 Joe Talerico 2015-10-09 15:59:40 UTC

Active/Back up seems to clean up some of these problems.

Comment 20 Flavio Leitner 2015-10-09 16:42:02 UTC

Joe,

Thanks very much for your testing so far. At this point I think the problem is very likely related to the way OVS bond implements LACP mode (using recirculation).  Active-backup doesn't use that, so it works.

The next step is actually do dig deeper into the OVS bond implementation to see what is going on.

Thanks,
fbl

Comment 21 Joe Talerico 2015-10-20 13:48:02 UTC

(In reply to Joe Talerico from comment #14)
> http://perf1.perf.lab.eng.bos.redhat.com/jtaleric/OpenStack/Random/
> controller-dpctl-flows.out
> 
> ^ looking at the active flows.
> 
> At any given time, we have > 200 flows.

External link:

https://gist.github.com/jtaleric/751086110b0394c94532

Comment 22 Flavio Leitner 2015-10-29 23:11:25 UTC

Upstream discussion:
http://openvswitch.org/pipermail//discuss/2015-October/019163.html

Comment 23 Dan Sneddon 2016-04-06 18:19:21 UTC

(In reply to Flavio Leitner from comment #22)
> Upstream discussion:
> http://openvswitch.org/pipermail//discuss/2015-October/019163.html

Just checking in on this bug. Were the suggested patches from the upstream discussion ever tested? Do we know if a fix was ever adopted in the net tree of the OVS kernel module?

Although we have a workaround currently by using Linux kernel bond mode for LACP traffic, there are some situations where we would like the flexibility to use OVS bonds. It would be nice to be able to support LACP bonds with OVS in RDO/OSP in the future, so I'd like to track when the OVS kernel modules are fixed.

Comment 24 Lance Richardson 2016-04-06 18:51:45 UTC

(In reply to Dan Sneddon from comment #23)

I can't answer the question of whether the suggested patches from the upstream
discussion were ever tested, but neither appears to have made it into the
upstream kernel (net-next) or upstream OVS (current master).

Comment 27 Dan Sneddon 2016-04-18 21:09:14 UTC

To give some additional background on this bug:

The original design goal behind using OVS as the bonding mechanism in TripleO was simplicity: since OVS was being used for bridging, adding bonding support directly to the OVS bridge made sense.

Unfortunately, in a few large-scale deployments with high traffic, packet loss has been observed in OVS bonds using LACP.

To counteract this behavior, support was added to TripleO to support Linux kernel-mode bonds, which are not susceptible to packet loss at scale.

When using OVS bonds, an OVS bridge is required. This necessitated the use of a bridge even on the storage nodes, where this wouldn't be required if using kernel bonds.

Now that we have the option of using kernel-mode bonds, we recommend this in most cases, unless an OVS-specific bonding mode is required (although different packet loss issues have occurred when using balance-slb mode bonds, see https://bugzilla.redhat.com/show_bug.cgi?id=1289962).

Work is underway to provide additional sample templates for Linux kernel-mode bonding, but the following rules apply:

  * OVS bridges must be used for interfaces used for Neutron tenant or provider VLANs, or for SNAT/Floating IPs, whether or not bonding is used on a particular interface.
  * Bond interfaces use "linux_bond" instead of "ovs_bond" for the type:
  * Templates use "bonding_options" instead of "ovs_options" for the bond options
  * Linux kernel bonding options have a different form, such as "mode=802.3ad" to indicate LACP bonding.  For more info, see https://www.kernel.org/doc/Documentation/networking/bonding.txt

Comment 29 Dan Sneddon 2016-04-19 23:51:26 UTC

Created attachment 1148801 [details]
Linux Bond NIC Templates

Comment 31 Dan Yocum 2016-04-20 02:18:16 UTC

Not that I have much of a say, but +1 for kernel-mode bonds.

Comment 33 Flavio Leitner 2016-06-29 00:57:45 UTC

Hi Jean,

This is something we could try to add to our OVS Bond QE scripts.
Two hosts connected using OVS bond LACP (balance-tcp), with vxlan tunnel and then create 100 netns on each side with valid IP addresses.  Then on a single NS, ping all the other ones.  If one fail, you have reproduced this issue.

         Host A                               Host B
          OVS                                  OVS
          /  \                                 /  \  
netns 0..99  VXLAN                        VXLAN    netns 0..99
               port 0  ----------- port 0
                Bond LACP             Bond LACP
               port 1  ----------- port 1

netns#0 should be able to ping all other local and remote netns.

Thanks,
fbl

Comment 34 Jean-Tsung Hsiao 2016-06-29 15:51:39 UTC

(In reply to Flavio Leitner from comment #33)
> Hi Jean,
> 
> This is something we could try to add to our OVS Bond QE scripts.
> Two hosts connected using OVS bond LACP (balance-tcp), with vxlan tunnel and
> then create 100 netns on each side with valid IP addresses.  Then on a
> single NS, ping all the other ones.  If one fail, you have reproduced this
> issue.
> 
>          Host A                               Host B
>           OVS                                  OVS
>           /  \                                 /  \  
> netns 0..99  VXLAN                        VXLAN    netns 0..99
>                port 0  ----------- port 0
>                 Bond LACP             Bond LACP
>                port 1  ----------- port 1
> 
> netns#0 should be able to ping all other local and remote netns.
> 
> Thanks,
> fbl

Hi Flavio
Sure, I will give a try. Some questions:

Using ovs-2.5.0-3 --- non-dpdk ovs? \

Using latest RHEL7.3 kernel like 447 ? Or 7.2.z like 327.13.1 ?

Thanks!

Jean

Comment 35 Flavio Leitner 2016-06-29 17:25:23 UTC

(In reply to Jean-Tsung Hsiao from comment #34)
> Using ovs-2.5.0-3 --- non-dpdk ovs? \
non-DPDK OVS.

> Using latest RHEL7.3 kernel like 447 ? Or 7.2.z like 327.13.1 ?
If there is an issue with 7.2, we might need to fix in 7.2.z as proposed already in this bz, so 7.2.z bits seems a good starting point.

But we would need to do the same for RHEL-7.3 bits later on.

Comment 36 Jean-Tsung Hsiao 2016-06-29 21:26:53 UTC

(In reply to Flavio Leitner from comment #35)
> (In reply to Jean-Tsung Hsiao from comment #34)
> > Using ovs-2.5.0-3 --- non-dpdk ovs? \
> non-DPDK OVS.
> 
> > Using latest RHEL7.3 kernel like 447 ? Or 7.2.z like 327.13.1 ?
> If there is an issue with 7.2, we might need to fix in 7.2.z as proposed
> already in this bz, so 7.2.z bits seems a good starting point.
> 
> But we would need to do the same for RHEL-7.3 bits later on.

Under kernel 447, and 125 name spaces on each side all 250 pings passed. So, I cannot reproduce the issue --- no route error.

Will try 327.13.1 next.

Comment 37 Jean-Tsung Hsiao 2016-06-30 02:01:33 UTC

(In reply to Flavio Leitner from comment #35)
> (In reply to Jean-Tsung Hsiao from comment #34)
> > Using ovs-2.5.0-3 --- non-dpdk ovs? \
> non-DPDK OVS.
> 
> > Using latest RHEL7.3 kernel like 447 ? Or 7.2.z like 327.13.1 ?
> If there is an issue with 7.2, we might need to fix in 7.2.z as proposed
> already in this bz, so 7.2.z bits seems a good starting point.
> 
> But we would need to do the same for RHEL-7.3 bits later on.

Hi Flavio,

Under 7.2.z there is no such issue either.

Comment 38 Flavio Leitner 2016-08-18 03:52:18 UTC

FYI bz#1366681 reproduced the same issue.

The analysis is done at 
https://bugzilla.redhat.com/show_bug.cgi?id=1366681#c51

The fix were proposed in upstream as RFC:
http://openvswitch.org/pipermail/dev/2016-March/067895.html
http://openvswitch.org/pipermail/dev/2016-March/067794.html

Comment 39 Jiri Benc 2016-08-22 14:51:06 UTC

Note that the fix was rejected by David Miller upstream:
http://openvswitch.org/pipermail/dev/2016-March/068046.html

I think it's because of a misunderstanding: the description suggests it's a per-packet limit (and then David's comment would make complete sense) but in fact, it's a per-CPU limit, shared by all packets processed on the CPU.

Comment 41 Lance Richardson 2016-08-24 22:06:43 UTC

(In reply to Jiri Benc from comment #39)
> Note that the fix was rejected by David Miller upstream:
> http://openvswitch.org/pipermail/dev/2016-March/068046.html
> 
> I think it's because of a misunderstanding: the description suggests it's a
> per-packet limit (and then David's comment would make complete sense) but in
> fact, it's a per-CPU limit, shared by all packets processed on the CPU.

Response added to thread:

   http://openvswitch.org/pipermail/dev/2016-August/078528.html

Comment 42 Lance Richardson 2016-08-25 21:32:45 UTC

*** Bug 1289825 has been marked as a duplicate of this bug. ***

Comment 43 Lance Richardson 2016-08-25 21:33:22 UTC

*** Bug 1366681 has been marked as a duplicate of this bug. ***

Comment 44 Lance Richardson 2016-09-28 15:18:36 UTC

The fix for this issue has been accepted upstream, and is currently being
targeted for RHEL 7.4 (Flavio Leitner is following up with PM regarding
whether it should go into a 7.3 fixes stream).

commit f43e6dfb056b58628e43179d8f6b59eae417754d
Author: Lance Richardson <lrichard>
Date:   Mon Sep 12 17:07:23 2016 -0400

    openvswitch: avoid deferred execution of recirc actions
    
    The ovs kernel data path currently defers the execution of all
    recirc actions until stack utilization is at a minimum.
    This is too limiting for some packet forwarding scenarios due to
    the small size of the deferred action FIFO (10 entries). For
    example, broadcast traffic sent out more than 10 ports with
    recirculation results in packet drops when the deferred action
    FIFO becomes full, as reported here:
    
         http://openvswitch.org/pipermail/dev/2016-March/067672.html
    
    Since the current recursion depth is available (it is already tracked
    by the exec_actions_level pcpu variable), we can use it to determine
    whether to execute recirculation actions immediately (safe when
    recursion depth is low) or defer execution until more stack space is
    available.
    
    With this change, the deferred action fifo size becomes a non-issue
    for currently failing scenarios because it is no longer used when
    there are three or fewer recursions through ovs_execute_actions().
    
    Suggested-by: Pravin Shelar <pshelar>

Comment 47 Miro Halas 2017-02-14 16:23:40 UTC

What is the status of this defect? It seems that the reference to this defect is no longer in RHOSP 10 director documentation, while it was still in RHOSP 8 and 9. 

Would anybody able to comment if this defect was fixed in some version / kernel / ovs package of RHEL 7.x?

Comment 48 Lance Richardson 2017-02-14 16:33:15 UTC

(In reply to Miro Halas from comment #47)
> What is the status of this defect? It seems that the reference to this
> defect is no longer in RHOSP 10 director documentation, while it was still
> in RHOSP 8 and 9. 
> 
> Would anybody able to comment if this defect was fixed in some version /
> kernel / ovs package of RHEL 7.x?

The fix for this issue will be in RHEL 7.4, and is not currently a candidate
for the RHEL 7.3 fixes stream.  Here is the BZ for the kernel fix:

   https://bugzilla.redhat.com/show_bug.cgi?id=1370643

Comment 50 Christopher Brown 2017-06-01 14:22:01 UTC

Could someone confirm this will no longer be an issue with RHEL 7.4 and we can revert to using balance-tcp with that update? Thanks

Comment 51 Lance Richardson 2017-06-01 14:29:44 UTC

This is fixed in RHEL 7.4 and the RHEL 7.3 fixes stream, the fixed kernel
versions are the associated BZs:

   https://bugzilla.redhat.com/show_bug.cgi?id=1370643
   https://bugzilla.redhat.com/show_bug.cgi?id=1388592

Note that, as far as I know, this has not been verified in an OpenStack
environment.

Comment 52 Christopher Brown 2017-06-01 14:46:31 UTC

(In reply to Lance Richardson from comment #51)
> This is fixed in RHEL 7.4 and the RHEL 7.3 fixes stream, the fixed kernel
> versions are the associated BZs:
> 
>    https://bugzilla.redhat.com/show_bug.cgi?id=1370643
>    https://bugzilla.redhat.com/show_bug.cgi?id=1388592
> 
> Note that, as far as I know, this has not been verified in an OpenStack
> environment.

Thanks but I can't access those bugs and the last indication was that it wasn't being fixed in RHEL 7.3. Good to know that it has been - perhaps you could make the version numbers for packages public?

Its probably a hard one to replicate in testing...

Comment 53 Lance Richardson 2017-06-01 15:00:00 UTC

(In reply to Christopher Brown from comment #52)
> (In reply to Lance Richardson from comment #51)
> > This is fixed in RHEL 7.4 and the RHEL 7.3 fixes stream, the fixed kernel
> > versions are the associated BZs:
> > 
> >    https://bugzilla.redhat.com/show_bug.cgi?id=1370643
> >    https://bugzilla.redhat.com/show_bug.cgi?id=1388592
> > 
> > Note that, as far as I know, this has not been verified in an OpenStack
> > environment.
> 
> Thanks but I can't access those bugs and the last indication was that it
> wasn't being fixed in RHEL 7.3. Good to know that it has been - perhaps you
> could make the version numbers for packages public?
> 
> Its probably a hard one to replicate in testing...

Upstream commit:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=2679d040412df847d390a3a8f0f224a7c91f7fae

Initial RHEL kernel versions containing the fix:
    kernel-3.10.0-519.el7 (7.4 stream)
    kernel-3.10.0-514.1.1.el7 (7.3 fixes stream)

Comment 54 Lance Richardson 2017-06-07 17:50:42 UTC

Reassigning to Nir, per suggestion by Flavio Leitner.

This issue is fixed in RHEL 7.4 and 7.3z kernels, ready to be picked
up by layered products.

Comment 57 Assaf Muller 2017-07-07 19:59:59 UTC

Fixed in 7.3.z kernel:
https://bugzilla.redhat.com/show_bug.cgi?id=1388592

Note You need to log in before you can comment on or make changes to this bug.

aloughla
amuller
apevec
asimonel
atragler
bperkins
chris.brown
chrisw
dchia
djuran
dsneddon
ealcaniz
felipe.alfaro
fleitner
gkadam
jbenc
jhsiao
jraju
jtaleric
kholden
kzhang
marjones
mhalas
mleitner
mlopes
myllynen
qding
rcernin
rhos-maint
rkhan
sputhenp
srevivo
vcojot
w.deborger