Description of problem: the CPU utilization of ovs-vswitchd is high without DPDK enabled PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 1512 root 10 -10 4352840 793864 12008 R 1101 0.3 15810:26 ovs-vswitchd at the same time we were observing failures to send packets (ICMP) over VXLAN tunnel, we think this might be related to high CPU usage. icmp packets failed to transfer over VXLAN tunnel, the VRRP would fail too and would cause fail-over of the qrouter to other controller, as well this can end up in split-brain situation. While capturing the perfdata, we noticed Check IO/CPU overload. # perf record -p `pidof ovs-vswitchd` -g --call-graph dwarf sleep 60 [ perf record: Woken up 37540 times to write data ] Warning: Processed 2632098 events and lost 59 chunks! Check IO/CPU overload! Warning: 377 out of order events recorded. [ perf record: Captured and wrote 20047.977 MB perf.data (2485324 samples) ] [1]+ Done pidstat -u -t -p `pidof ovs-vswitchd`,`pidof ovsdb-server` 5 12 > ovs_pidstat.txt The files are attached to this Bugzilla Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
From log /cases/02056811/sosreport-20180316-125253/esclos042001.eecloud.nsn-net.net/ I see L2 agent is continuously restarting from "2018-03-16 03:18" till "2018-03-16 06:50" as there is a TimeoutException while adding br-int bridge. After that "ofctl requests" are getting timeout, resulting in "Agent out of sync with plugin" and syncing the ports again. L3 agent also has similar timeout issues. During this time, ovs-vswitchd is frequently dropping the connection(below message in ovs-vswitchd log) as there is no response to inactivity probe. 2018-03-16T09:28:10.372Z|47510|rconn|ERR|br-tun<->tcp:127.0.0.1:6633: no response to inactivity probe after 5 seconds, disconnecting Can we increase this inactivity to probe to a larger number instead of 5 seconds and try?
I assume /cases/02056811/sosreport-20180316-125253/esclos042001.eecloud.nsn-net.net/ is the sosreport from a machine that exhibits the problem? $ wc -l < sos_commands/openvswitch/ovs-ofctl_dump-flows_br-tun 5683 $ wc -l < sos_commands/openvswitch/ovs-dpctl_dump-flows_-m 14328 and in the ovs log there are tons of stalls on handler and revalidator threads. Could it be just a packet storm that the machine is not able to cope with?
The kernel log contains a few entries such as this: [ 683.252404] IPv6: tap4c1d6da3-e0: IPv6 duplicate address fc00::172:24:101:3 detected! Those interfaces are still in the ovs bridge. Seems there's a loop somewhere?
(In reply to Jiri Benc from comment #18) > $ wc -l < sos_commands/openvswitch/ovs-dpctl_dump-flows_-m > 14328 Out of that, 351 are non-tunnel flows, the rest is from tunnels. There are 252 vxlan tunnels configured and the flows refer to 261 different tunnel ids. That's not unreasonable. Numbers of flows for individual tunnel ids: $ sed -n 's/.*tun_id=\([0-9a-fx]*\).*/\1/p' sos_commands/openvswitch/ovs-dpctl_dump-flows_-m | sort | uniq -c This is all probably okay.
Regarding the message Jiri noticed: [ 683.252404] IPv6: tap4c1d6da3-e0: IPv6 duplicate address fc00::172:24:101:3 detected! That looks suspicious since it looks like someone took the beginning of the ULA range (fc00::/7) and appended an IPv4 address, 172.24.101.3. Neutron wouldn't do that normally as it always calculates IPv6 addresses based on the port MAC, it would have had to be manually assigned to the port. Might be unrelated unless it is showing there is a loop somewhere.
Can we get the file descriptors count ovs-vswitchd is using? What this command says? ls "/proc/$(pidof ovs-vswitchd)/fd" |wc -l
(In reply to Matteo Croce from comment #22) > Can we get the file descriptors count ovs-vswitchd is using? > What this command says? > > ls "/proc/$(pidof ovs-vswitchd)/fd" |wc -l I know what you mean, but nope, that's not it. I checked that as one of the first things. The daemon is below the max fd limit. There's lsof output in the sosreport in case you want to take a look yourself. Also, the messages in the ovs-vswitchd log don't match the excessive fd problem. $ grep '^ovs.*1484' process/lsof_-b_M_-n_-l | wc -l 53240
Any progress with the requested data from comment 24? Note that due to Easter holidays, I'm off until Monday. And yes, it looks very similar to bug 1509994: 1. From the sosreport: out of 14328 datapath flows, 13697 are for arps. 2. From the perf data: the threads are mostly busy doing packet processing and waiting on a spinlock in the kernel. A typical thread looks like this: # Overhead Samples Command Shared Object Symbol # ........ ............ .............. ................... .......................................... # 26.42% 932 revalidator435 ovs-vswitchd [.] xbundle_includes_vlan 7.44% 270 revalidator435 [kernel.kallsyms] [k] native_queued_spin_lock_slowpath 7.30% 257 revalidator435 ovs-vswitchd [.] xlate_normal_flood 2.21% 77 revalidator435 [kernel.kallsyms] [k] pcpu_alloc_area 2.15% 77 revalidator435 ovs-vswitchd [.] compose_output_action__ 1.71% 62 revalidator435 [kernel.kallsyms] [k] try_to_wake_up 1.69% 59 revalidator435 libc-2.17.so [.] _int_free 1.68% 59 revalidator435 ovs-vswitchd [.] classifier_lookup__ 1.45% 53 revalidator435 ovs-vswitchd [.] mbundle_lookup 1.43% 51 revalidator435 [kernel.kallsyms] [k] osq_lock 1.28% 45 revalidator435 [kernel.kallsyms] [k] mutex_spin_on_owner 1.14% 41 revalidator435 libpthread-2.17.so [.] pthread_mutex_unlock 1.07% 38 revalidator435 ovs-vswitchd [.] parse_flow_nlattrs 1.02% 37 revalidator435 libpthread-2.17.so [.] pthread_mutex_lock 1.02% 37 revalidator435 libc-2.17.so [.] malloc_consolidate The high CPU usage in xbundle_includes_vlan is attributed to bundle->trunks bitmap access (in xbundle_trunks_vlan). Not sure how reliable that report is, though.
We are still waiting for update from customer to apply the changes and provide the requested data.
Did u notice any pattern in the ARP request? I mean, are the ARP requests for HA router ports?
I can't log in to collab-shell. Getting "Permission denied". Not sure what happened, I was able to log in previously. There's no contact person stated anywhere. I don't know who to contact to fix that. Any idea?
That said, I don't think I can help further. The issue does not seem to be in Open vSwitch.
Asking for some help on Ryan Ohara about why could we be seeing the VRRP_Instance(VR_2) Received advert with lower priority 50, ours 50, forcing new election and all the garps repeating.
Data parsed from all the sosreports I could find: /cases/02056811/sosreport-20180316-070825/esclos042037.eecloud.nsn-net.net hostname: esclos042037.eecloud.nsn-net.net date: Fri Mar 16 07:07:47 EET 2018 ovs load: 947 | 15 threads at >40 vxlans: 272 qr ifaces: 309 qg ifaces: 277 ha ifaces: 299 arp dp flows: 89% | 7406 / 8254 /cases/02056811/sosreport-20180316-070825/esclos042073.eecloud.nsn-net.net hostname: esclos042073.eecloud.nsn-net.net date: Fri Mar 16 07:08:25 EET 2018 ovs load: 733 | 15 threads at >40 vxlans: 272 qr ifaces: 310 qg ifaces: 279 ha ifaces: 300 arp dp flows: 92% | 15002 / 16274 /cases/02056811/sosreport-20180316-125253/esclos042001.eecloud.nsn-net.net hostname: esclos042001.eecloud.nsn-net.net date: Fri Mar 16 12:52:53 EET 2018 ovs load: 893 | 16 threads at >40 vxlans: 252 qr ifaces: 306 qg ifaces: 274 ha ifaces: 295 arp dp flows: 95% | 13697 / 14328 /cases/02056811/sosreport-20180329-121944 hostname: esclos999001.eecloud.nsn-net.net date: Thu Mar 29 12:22:59 EEST 2018 ovs load: 23.2 | 0 threads at >40 vxlans: 19 qr ifaces: 229 qg ifaces: 230 ha ifaces: 285 arp dp flows: 6% | 26 / 372
Another interesting data is ovs-appctl upcall.show: /cases/02056811/sosreport-20180316-070825/esclos042037.eecloud.nsn-net.net system@ovs-system: flows : (current 9935) (avg 9857) (max 17348) (limit 200000) dump duration : 87ms ufid enabled : true 435: (keys 693) 436: (keys 683) 437: (keys 697) 438: (keys 673) 439: (keys 652) 440: (keys 664) 441: (keys 670) 442: (keys 672) 443: (keys 673) 444: (keys 687) 445: (keys 670) 446: (keys 668) 447: (keys 638) 448: (keys 663) 449: (keys 682) /cases/02056811/sosreport-20180316-070825/esclos042073.eecloud.nsn-net.net system@ovs-system: flows : (current 16301) (avg 16330) (max 17278) (limit 200000) dump duration : 81ms ufid enabled : true 435: (keys 1090) 436: (keys 1119) 437: (keys 1121) 438: (keys 1089) 439: (keys 1063) 440: (keys 1101) 441: (keys 1067) 442: (keys 1085) 443: (keys 1070) 444: (keys 1136) 445: (keys 1103) 446: (keys 1041) 447: (keys 1047) 448: (keys 1146) 449: (keys 1053) /cases/02056811/sosreport-20180316-125253/esclos042001.eecloud.nsn-net.net system@ovs-system: flows : (current 14419) (avg 14418) (max 14790) (limit 200000) dump duration : 49ms ufid enabled : true 435: (keys 934) 436: (keys 978) 437: (keys 967) 438: (keys 985) 439: (keys 934) 440: (keys 980) 441: (keys 954) 442: (keys 948) 443: (keys 927) 444: (keys 970) 445: (keys 980) 446: (keys 962) 447: (keys 1000) 448: (keys 936) 449: (keys 966) /cases/02056811/sosreport-20180329-121944 system@ovs-system: flows : (current 366) (avg 369) (max 1533) (limit 5000) dump duration : 7ms ufid enabled : true 181: (keys 38) 182: (keys 33) 183: (keys 21) 184: (keys 28) 185: (keys 34) 186: (keys 25) 187: (keys 23) 188: (keys 34) 189: (keys 21) 190: (keys 24) 191: (keys 23) 192: (keys 33) 193: (keys 32)
What I think we're seeing is something (what is that 'something' is unclear yet) causing the dp flows to be continuously revalidated. The revalidator threads then not only do a lot of work as there are many complex flows (refer to the long flows in comment 46) but xlating of these flows on different CPUs causes contention on trunk bitmaps (cache ping pong; refer to comment 26 where it is clearly visible). Interestingly, according to Miguel, those trunk ports that cause the costly cache ping pong should not even be there - they were not tagged while they should have been.
Created attachment 1427741 [details] Ovs scripts to simulate the environment I've been trying to reproduce the high CPU usage with no success. Attached are the scripts I've been using. "setup" creates a bridge with 500 ports (note that it takes ages, see also below), arpsend.py bombs the bridge with random ARPs. Alternatively, arpsend2.py sends 10000 ARPs repeatedly. The CPUs are only mildly loaded (< 6% for the revalidator threads). To force frequent revalidation, I tried to change the bridge config in a tight loop (setting a port as an access port and changing its tag, adding/removing a port) and noticed that it takes many seconds for ovs-vswitchd to process each request. This of course could not have any effect on the revalidator threads. But it's interesting. Even commands such as "ovs-appctl upcall/show" sometimes take many seconds to complete. From strace, ovs-appctl is waiting for a reply on unix socket from ovs-vswitchd. These stalls don't happen every time but are happening even when there's no traffic flowing.
I managed to reproduce and analyze this. First, the reproduction steps. It's actually surprisingly simple once you explore all the blind alleys. Create an ovs bridge: ------ ovs-vsctl add-br ovs0 ip l s ovs0 up ------ Save this to a file named "reproducer.py": ------ #!/usr/bin/python from scapy.all import * data = [(str(RandMAC()), str(RandIP())) for i in range(int(sys.argv[1]))] s = conf.L2socket(iface="ovs0") while True: for mac, ip in data: p = Ether(src=mac, dst=mac)/IP(src=ip, dst=ip) s.send(p) ------ Run the reproducer: ./reproducer.py 5000
The problem is how flow revalidation works in ovs. There are several 'revalidator' threads launched. They should normally sleep (modulo waking every 0.5 second just to do nothing) and they wake if anything of interest happens (udpif_revalidator => poll_block). On every wake up, each revalidator thread checks whether flow revalidation is needed and if it is, it does the revalidation. The revalidation is very costly with high number of flows. I also suspect there's a lot of contention between the revalidator threads. The flow revalidation is triggered by many things. What is of interest for us is that any eviction of a MAC learning table entry triggers revalidation. The reproducer script repeatedly sends the same 5000 packets, all of them with a different MAC address. This causes constant overflows of the MAC learning table and constant revalidation. The revalidator threads are being immediately woken up and are busy looping the revalidation. Which is exactly the pattern from the customers' data: there are 16000+ flows and the packet capture shows that the packets are repeating every second. A quick fix is to increase the MAC learning table size: ovs-vsctl set bridge <bridge> other-config:mac-table-size=50000 This should lower the CPU usage down substantially; allow a few seconds for things to settle down.
Now, what ovs does with the revalidation is not really smart. There's no reason for the costly revalidation just because the MAC learning table overflew, we should be able to be much more targeted. Also, an analysis of contention between the revalidator threads would be a useful thing to do. For now, I suggest to deploy the workaround with increased mac-table-size and we should look into optimizing the revalidation long term.
(In reply to Jiri Benc from comment #69) > A quick fix is to increase the MAC learning table size: > > ovs-vsctl set bridge <bridge> other-config:mac-table-size=50000 > > This should lower the CPU usage down substantially; allow a few seconds for > things to settle down. I think this is something we should implement in both openstack neutron-openvswitch-agent and ovn-controller, although ovn-controller doesn't use NORMAL rules (and no huge flows will be created in the datapath I guess), I understand it should be safer to handle a higher number. Probably a setting for both with a high default may be the best option.
(In reply to Jiri Benc from comment #69) > The reproducer script repeatedly sends the same 5000 packets, all of them > with a different MAC address. This causes constant overflows of the MAC > learning table and constant revalidation. The revalidator threads are being > immediately woken up and are busy looping the revalidation. It was odd that the replicator needed roughly 5000 packets to trigger the issue, vs 2048+ as you would expect. The issue is with the replicator script, the random mac function also generates Multicast mac addresses which are not accepted :) About half of the packets generated where multicast... So if you change the generation of the MACs as follows: data = [(str("00" + str(RandMAC())[2:]), str(RandIP())) for i in range(int(sys.argv[1]))] You will see the issue with 2100 packets. Next step will be to add some instrumentation, counters that make it easy to identify issues like this. After that will try to see how/if revalidation can be more optimized for MAC table changes.
OSP11 is now retired, see details at https://access.redhat.com/errata/product/191/ver=11/rhel---7/x86_64/RHBA-2018:1828
Re-opening this BZ as we are still working on this, and is a problem in all OSP versions. Bumped the reported into 12 for now.
Sent patch upstream to add additional troubleshooting counters: https://mail.openvswitch.org/pipermail/ovs-dev/2018-June/348546.html
Sent patch upstream to see if people would accept an increase in the default mac table size: https://mail.openvswitch.org/pipermail/ovs-dev/2018-June/348683.html
After some more research, I'll close this BZ as will not fix, as the root cause for this issue was an FDB table being configured to small. A set of separate BZs have been created to make the FDB configuration available in OSP. Changing the default FDB size to a 50K+ value is not a realistic option. First, because it's already configurable (and this is a deployment specific adjustment), and second it might open up OVS for memory depletion attacks. We have submitted an upstream patch to make troubleshooting these kinds of rapid FDB replacements easier. See comment #87. I also looked at making the revalidation process faster for this specific issue. However it's not a straightforward change, and on top of this, it will not result in less CPU usage. If we make the process "faster" it will just result in more FDB updates per second, not in less CPU usage in the revalidator threads.
@eelco @jbenc, everybody involved, great job. Thank you.