Bug 1558336 - high ovs-vswitchd CPU usage and VRRP over VXLAN tunnel causing qrouter fail-over
Summary: high ovs-vswitchd CPU usage and VRRP over VXLAN tunnel causing qrouter fail-over
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openvswitch
Version: 12.0 (Pike)
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: zstream
: 12.0 (Pike)
Assignee: Eelco Chaudron
QA Contact: Ofer Blaut
URL:
Whiteboard:
Depends On: 1575696 1575706 1579300 1579352 1579400
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-03-20 04:42 UTC by Robin Cernin
Modified: 2022-08-09 11:11 UTC (History)
31 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1589031 1589058 (view as bug list)
Environment:
Last Closed: 2018-07-02 11:03:53 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Ovs scripts to simulate the environment (850 bytes, application/x-bzip)
2018-04-27 17:19 UTC, Jiri Benc
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker OSP-8885 0 None None None 2022-08-09 11:11:38 UTC

Description Robin Cernin 2018-03-20 04:42:24 UTC
Description of problem:

the CPU utilization of ovs-vswitchd is high without DPDK enabled

 PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
1512 root      10 -10 4352840 793864  12008 R  1101  0.3  15810:26 ovs-vswitchd

at the same time we were observing failures to send packets (ICMP) over VXLAN tunnel, we think this might be related to high CPU usage.

icmp packets failed to transfer over VXLAN tunnel, the VRRP would fail too and would cause fail-over of the qrouter to other controller, as well this can end up in split-brain situation.

While capturing the perfdata, we noticed Check IO/CPU overload.

# perf record -p `pidof ovs-vswitchd` -g --call-graph dwarf sleep 60
[ perf record: Woken up 37540 times to write data ]
Warning:
Processed 2632098 events and lost 59 chunks!

Check IO/CPU overload!

Warning:
377 out of order events recorded.
[ perf record: Captured and wrote 20047.977 MB perf.data (2485324 samples) ]
[1]+  Done                    pidstat -u -t -p `pidof ovs-vswitchd`,`pidof ovsdb-server` 5 12 > ovs_pidstat.txt

The files are attached to this Bugzilla

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 16 anil venkata 2018-03-23 09:11:42 UTC
From log /cases/02056811/sosreport-20180316-125253/esclos042001.eecloud.nsn-net.net/
I see L2 agent is continuously restarting from "2018-03-16 03:18" till "2018-03-16 06:50" as there is a TimeoutException while adding br-int bridge.

After that "ofctl requests" are getting timeout, resulting in "Agent out of sync with plugin" and syncing the ports again. L3 agent also has similar timeout issues.

During this time, ovs-vswitchd is frequently dropping the connection(below message in ovs-vswitchd log) as there is no response to inactivity probe.

2018-03-16T09:28:10.372Z|47510|rconn|ERR|br-tun<->tcp:127.0.0.1:6633: no response to inactivity probe after 5 seconds, disconnecting

Can we increase this inactivity to probe to a larger number instead of 5 seconds and try?

Comment 18 Jiri Benc 2018-03-23 11:44:59 UTC
I assume /cases/02056811/sosreport-20180316-125253/esclos042001.eecloud.nsn-net.net/ is the sosreport from a machine that exhibits the problem?

$ wc -l < sos_commands/openvswitch/ovs-ofctl_dump-flows_br-tun
5683

$ wc -l < sos_commands/openvswitch/ovs-dpctl_dump-flows_-m
14328

and in the ovs log there are tons of stalls on handler and revalidator threads.

Could it be just a packet storm that the machine is not able to cope with?

Comment 19 Jiri Benc 2018-03-23 11:57:07 UTC
The kernel log contains a few entries such as this:

[  683.252404] IPv6: tap4c1d6da3-e0: IPv6 duplicate address fc00::172:24:101:3 detected!

Those interfaces are still in the ovs bridge. Seems there's a loop somewhere?

Comment 20 Jiri Benc 2018-03-23 12:12:46 UTC
(In reply to Jiri Benc from comment #18)
> $ wc -l < sos_commands/openvswitch/ovs-dpctl_dump-flows_-m
> 14328

Out of that, 351 are non-tunnel flows, the rest is from tunnels. There are 252 vxlan tunnels configured and the flows refer to 261 different tunnel ids. That's not unreasonable.

Numbers of flows for individual tunnel ids:
$ sed -n 's/.*tun_id=\([0-9a-fx]*\).*/\1/p' sos_commands/openvswitch/ovs-dpctl_dump-flows_-m | sort | uniq -c

This is all probably okay.

Comment 21 Brian Haley 2018-03-23 13:36:02 UTC
Regarding the message Jiri noticed:

[  683.252404] IPv6: tap4c1d6da3-e0: IPv6 duplicate address fc00::172:24:101:3 detected!

That looks suspicious since it looks like someone took the beginning of the ULA range (fc00::/7) and appended an IPv4 address, 172.24.101.3.  Neutron wouldn't do that normally as it always calculates IPv6 addresses based on the port MAC, it would have had to be manually assigned to the port.

Might be unrelated unless it is showing there is a loop somewhere.

Comment 22 Matteo Croce 2018-03-23 15:05:56 UTC
Can we get the file descriptors count ovs-vswitchd is using?
What this command says?

ls "/proc/$(pidof ovs-vswitchd)/fd" |wc -l

Comment 23 Jiri Benc 2018-03-23 17:51:19 UTC
(In reply to Matteo Croce from comment #22)
> Can we get the file descriptors count ovs-vswitchd is using?
> What this command says?
> 
> ls "/proc/$(pidof ovs-vswitchd)/fd" |wc -l

I know what you mean, but nope, that's not it. I checked that as one of the first things. The daemon is below the max fd limit. There's lsof output in the sosreport in case you want to take a look yourself. Also, the messages in the ovs-vswitchd log don't match the excessive fd problem.

$ grep '^ovs.*1484' process/lsof_-b_M_-n_-l | wc -l
53240

Comment 26 Jiri Benc 2018-03-28 14:02:16 UTC
Any progress with the requested data from comment 24? Note that due to Easter holidays, I'm off until Monday.

And yes, it looks very similar to bug 1509994:

1. From the sosreport: out of 14328 datapath flows, 13697 are for arps.

2. From the perf data: the threads are mostly busy doing packet processing and waiting on a spinlock in the kernel. A typical thread looks like this:

# Overhead       Samples  Command         Shared Object        Symbol                                    
# ........  ............  ..............  ...................  ..........................................
#
    26.42%           932  revalidator435  ovs-vswitchd         [.] xbundle_includes_vlan
     7.44%           270  revalidator435  [kernel.kallsyms]    [k] native_queued_spin_lock_slowpath
     7.30%           257  revalidator435  ovs-vswitchd         [.] xlate_normal_flood
     2.21%            77  revalidator435  [kernel.kallsyms]    [k] pcpu_alloc_area
     2.15%            77  revalidator435  ovs-vswitchd         [.] compose_output_action__
     1.71%            62  revalidator435  [kernel.kallsyms]    [k] try_to_wake_up
     1.69%            59  revalidator435  libc-2.17.so         [.] _int_free
     1.68%            59  revalidator435  ovs-vswitchd         [.] classifier_lookup__
     1.45%            53  revalidator435  ovs-vswitchd         [.] mbundle_lookup
     1.43%            51  revalidator435  [kernel.kallsyms]    [k] osq_lock
     1.28%            45  revalidator435  [kernel.kallsyms]    [k] mutex_spin_on_owner
     1.14%            41  revalidator435  libpthread-2.17.so   [.] pthread_mutex_unlock
     1.07%            38  revalidator435  ovs-vswitchd         [.] parse_flow_nlattrs
     1.02%            37  revalidator435  libpthread-2.17.so   [.] pthread_mutex_lock
     1.02%            37  revalidator435  libc-2.17.so         [.] malloc_consolidate

The high CPU usage in xbundle_includes_vlan is attributed to bundle->trunks bitmap access (in xbundle_trunks_vlan). Not sure how reliable that report is, though.

Comment 27 Gyanendra Kumar 2018-03-29 05:08:18 UTC
We are still waiting for update from customer to apply the changes and provide the requested data.

Comment 28 anil venkata 2018-03-29 07:38:50 UTC
Did u notice any pattern in the ARP request? I mean, are the ARP requests for HA router ports?

Comment 33 Jiri Benc 2018-04-06 12:07:35 UTC
I can't log in to collab-shell. Getting "Permission denied". Not sure what happened, I was able to log in previously. There's no contact person stated anywhere. I don't know who to contact to fix that. Any idea?

Comment 43 Jiri Benc 2018-04-18 11:41:22 UTC
That said, I don't think I can help further. The issue does not seem to be in Open vSwitch.

Comment 53 Miguel Angel Ajo 2018-04-23 12:14:25 UTC
Asking for some help on Ryan Ohara about why could we be seeing the VRRP_Instance(VR_2) Received advert with lower priority 50, ours 50, forcing new election and all the garps repeating.

Comment 59 Jiri Benc 2018-04-25 13:59:56 UTC
Data parsed from all the sosreports I could find:

/cases/02056811/sosreport-20180316-070825/esclos042037.eecloud.nsn-net.net
        hostname: esclos042037.eecloud.nsn-net.net
        date: Fri Mar 16 07:07:47 EET 2018
        ovs load: 947 | 15 threads at >40
        vxlans: 272
        qr ifaces: 309
        qg ifaces: 277
        ha ifaces: 299
        arp dp flows: 89% | 7406 / 8254

/cases/02056811/sosreport-20180316-070825/esclos042073.eecloud.nsn-net.net
        hostname: esclos042073.eecloud.nsn-net.net
        date: Fri Mar 16 07:08:25 EET 2018
        ovs load: 733 | 15 threads at >40
        vxlans: 272
        qr ifaces: 310
        qg ifaces: 279
        ha ifaces: 300
        arp dp flows: 92% | 15002 / 16274

/cases/02056811/sosreport-20180316-125253/esclos042001.eecloud.nsn-net.net
        hostname: esclos042001.eecloud.nsn-net.net
        date: Fri Mar 16 12:52:53 EET 2018
        ovs load: 893 | 16 threads at >40
        vxlans: 252
        qr ifaces: 306
        qg ifaces: 274
        ha ifaces: 295
        arp dp flows: 95% | 13697 / 14328

/cases/02056811/sosreport-20180329-121944
        hostname: esclos999001.eecloud.nsn-net.net
        date: Thu Mar 29 12:22:59 EEST 2018
        ovs load: 23.2 | 0 threads at >40
        vxlans: 19
        qr ifaces: 229
        qg ifaces: 230
        ha ifaces: 285
        arp dp flows: 6% | 26 / 372

Comment 60 Jiri Benc 2018-04-26 16:33:30 UTC
Another interesting data is ovs-appctl upcall.show:

/cases/02056811/sosreport-20180316-070825/esclos042037.eecloud.nsn-net.net
system@ovs-system:
        flows         : (current 9935) (avg 9857) (max 17348) (limit 200000)
        dump duration : 87ms
        ufid enabled : true

        435: (keys 693)
        436: (keys 683)
        437: (keys 697)
        438: (keys 673)
        439: (keys 652)
        440: (keys 664)
        441: (keys 670)
        442: (keys 672)
        443: (keys 673)
        444: (keys 687)
        445: (keys 670)
        446: (keys 668)
        447: (keys 638)
        448: (keys 663)
        449: (keys 682)

/cases/02056811/sosreport-20180316-070825/esclos042073.eecloud.nsn-net.net
system@ovs-system:
        flows         : (current 16301) (avg 16330) (max 17278) (limit 200000)
        dump duration : 81ms
        ufid enabled : true
                                                                                                                      
        435: (keys 1090)                                                                                              
        436: (keys 1119)                                                                                              
        437: (keys 1121)                                                                                              
        438: (keys 1089)                                                                                              
        439: (keys 1063)                                                                                              
        440: (keys 1101)                                                                                              
        441: (keys 1067)                                                                                              
        442: (keys 1085)                                                                                              
        443: (keys 1070)                                                                                              
        444: (keys 1136)                                                                                              
        445: (keys 1103)                                                                                              
        446: (keys 1041)                                                                                              
        447: (keys 1047)                                                                                              
        448: (keys 1146)                                                                                              
        449: (keys 1053)                                                                                              
                                                                                                                      
/cases/02056811/sosreport-20180316-125253/esclos042001.eecloud.nsn-net.net                                            
system@ovs-system:                                                                                                    
        flows         : (current 14419) (avg 14418) (max 14790) (limit 200000)                                        
        dump duration : 49ms                                                                                          
        ufid enabled : true                                                                                           
                                                                                                                      
        435: (keys 934)                                                                                               
        436: (keys 978)                                                                                               
        437: (keys 967)                                                                                               
        438: (keys 985)                                                                                               
        439: (keys 934)                                                                                               
        440: (keys 980)                                                                                               
        441: (keys 954)                                                                                               
        442: (keys 948)                                                                                               
        443: (keys 927)                                                                                               
        444: (keys 970)                                                                                               
        445: (keys 980)                                                                                               
        446: (keys 962)
        447: (keys 1000)
        448: (keys 936)
        449: (keys 966)

/cases/02056811/sosreport-20180329-121944
system@ovs-system:
        flows         : (current 366) (avg 369) (max 1533) (limit 5000)
        dump duration : 7ms
        ufid enabled : true

        181: (keys 38)
        182: (keys 33)
        183: (keys 21)
        184: (keys 28)
        185: (keys 34)
        186: (keys 25)
        187: (keys 23)
        188: (keys 34)
        189: (keys 21)
        190: (keys 24)
        191: (keys 23)
        192: (keys 33)
        193: (keys 32)

Comment 61 Jiri Benc 2018-04-26 16:42:12 UTC
What I think we're seeing is something (what is that 'something' is unclear yet) causing the dp flows to be continuously revalidated. The revalidator threads then not only do a lot of work as there are many complex flows (refer to the long flows in comment 46) but xlating of these flows on different CPUs causes contention on trunk bitmaps (cache ping pong; refer to comment 26 where it is clearly visible).

Interestingly, according to Miguel, those trunk ports that cause the costly cache ping pong should not even be there - they were not tagged while they should have been.

Comment 64 Jiri Benc 2018-04-27 17:19:18 UTC
Created attachment 1427741 [details]
Ovs scripts to simulate the environment

I've been trying to reproduce the high CPU usage with no success.

Attached are the scripts I've been using. "setup" creates a bridge with 500 ports (note that it takes ages, see also below), arpsend.py bombs the bridge with random ARPs. Alternatively, arpsend2.py sends 10000 ARPs repeatedly.

The CPUs are only mildly loaded (< 6% for the revalidator threads).

To force frequent revalidation, I tried to change the bridge config in a tight loop (setting a port as an access port and changing its tag, adding/removing a port) and noticed that it takes many seconds for ovs-vswitchd to process each request. This of course could not have any effect on the revalidator threads.

But it's interesting. Even commands such as "ovs-appctl upcall/show" sometimes take many seconds to complete. From strace, ovs-appctl is waiting for a reply on unix socket from ovs-vswitchd. These stalls don't happen every time but are happening even when there's no traffic flowing.

Comment 68 Jiri Benc 2018-05-31 18:03:36 UTC
I managed to reproduce and analyze this.

First, the reproduction steps. It's actually surprisingly simple once you explore all the blind alleys.

Create an ovs bridge:

------
ovs-vsctl add-br ovs0
ip l s ovs0 up
------

Save this to a file named "reproducer.py":

------
#!/usr/bin/python
from scapy.all import *

data = [(str(RandMAC()), str(RandIP())) for i in range(int(sys.argv[1]))]

s = conf.L2socket(iface="ovs0")
while True:
    for mac, ip in data:
        p = Ether(src=mac, dst=mac)/IP(src=ip, dst=ip)
        s.send(p)
------

Run the reproducer:

./reproducer.py 5000

Comment 69 Jiri Benc 2018-05-31 18:26:26 UTC
The problem is how flow revalidation works in ovs. There are several 'revalidator' threads launched. They should normally sleep (modulo waking every 0.5 second just to do nothing) and they wake if anything of interest happens (udpif_revalidator => poll_block). On every wake up, each revalidator thread checks whether flow revalidation is needed and if it is, it does the revalidation.

The revalidation is very costly with high number of flows. I also suspect there's a lot of contention between the revalidator threads.

The flow revalidation is triggered by many things. What is of interest for us is that any eviction of a MAC learning table entry triggers revalidation.

The reproducer script repeatedly sends the same 5000 packets, all of them with a different MAC address. This causes constant overflows of the MAC learning table and constant revalidation. The revalidator threads are being immediately woken up and are busy looping the revalidation.

Which is exactly the pattern from the customers' data: there are 16000+ flows and the packet capture shows that the packets are repeating every second.

A quick fix is to increase the MAC learning table size:

ovs-vsctl set bridge <bridge> other-config:mac-table-size=50000

This should lower the CPU usage down substantially; allow a few seconds for things to settle down.

Comment 70 Jiri Benc 2018-05-31 18:31:34 UTC
Now, what ovs does with the revalidation is not really smart. There's no reason for the costly revalidation just because the MAC learning table overflew, we should be able to be much more targeted. Also, an analysis of contention between the revalidator threads would be a useful thing to do.

For now, I suggest to deploy the workaround with increased mac-table-size and we should look into optimizing the revalidation long term.

Comment 78 Miguel Angel Ajo 2018-06-08 08:42:56 UTC
(In reply to Jiri Benc from comment #69)

> A quick fix is to increase the MAC learning table size:
> 
> ovs-vsctl set bridge <bridge> other-config:mac-table-size=50000
> 
> This should lower the CPU usage down substantially; allow a few seconds for
> things to settle down.



I think this is something we should implement in both openstack neutron-openvswitch-agent and ovn-controller, although ovn-controller doesn't use NORMAL rules (and no huge flows will be created in the datapath I guess), I understand it should be safer to handle a higher number.

Probably a setting for both with a high default may be the best option.

Comment 84 Eelco Chaudron 2018-06-20 12:20:36 UTC
(In reply to Jiri Benc from comment #69)

> The reproducer script repeatedly sends the same 5000 packets, all of them
> with a different MAC address. This causes constant overflows of the MAC
> learning table and constant revalidation. The revalidator threads are being
> immediately woken up and are busy looping the revalidation.

It was odd that the replicator needed roughly 5000 packets to trigger the issue, vs 2048+ as you would expect. 

The issue is with the replicator script, the random mac function also generates Multicast mac addresses which are not accepted :) About half of the packets generated where multicast... 

So if you change the generation of the MACs as follows:

data = [(str("00" + str(RandMAC())[2:]), str(RandIP())) for i in range(int(sys.argv[1]))]

You will see the issue with 2100 packets.

Next step will be to add some instrumentation, counters that make it easy to identify issues like this.

After that will try to see how/if revalidation can be more optimized for MAC table changes.

Comment 85 Scott Lewis 2018-06-22 12:35:29 UTC
OSP11 is now retired, see details at https://access.redhat.com/errata/product/191/ver=11/rhel---7/x86_64/RHBA-2018:1828

Comment 86 Eelco Chaudron 2018-06-22 12:48:19 UTC
Re-opening this BZ as we are still working on this, and is a problem in all OSP versions. Bumped the reported into 12 for now.

Comment 87 Eelco Chaudron 2018-06-25 11:34:45 UTC
Sent patch upstream to add additional troubleshooting counters:

https://mail.openvswitch.org/pipermail/ovs-dev/2018-June/348546.html

Comment 88 Eelco Chaudron 2018-06-27 11:28:11 UTC
Sent patch upstream to see if people would accept an increase in the default mac table size:

https://mail.openvswitch.org/pipermail/ovs-dev/2018-June/348683.html

Comment 89 Eelco Chaudron 2018-07-02 11:03:53 UTC
After some more research, I'll close this BZ as will not fix, as the root cause for this issue was an FDB table being configured to small.

A set of separate BZs have been created to make the FDB configuration available in OSP. 

Changing the default FDB size to a 50K+ value is not a realistic option. First, because it's already configurable (and this is a deployment specific adjustment), and second it might open up OVS for memory depletion attacks. 

We have submitted an upstream patch to make troubleshooting these kinds of rapid FDB replacements easier. See comment #87.

I also looked at making the revalidation process faster for this specific issue. However it's not a straightforward change, and on top of this, it will not result in less CPU usage. If we make the process "faster" it will just result in more FDB updates per second, not in less CPU usage in the revalidator threads.

Comment 90 Miguel Angel Ajo 2018-07-03 12:20:10 UTC
@eelco @jbenc, everybody involved, great job. Thank you.


Note You need to log in before you can comment on or make changes to this bug.