1695722 – ovn-controller is using 100% cpu on OVN deployment, delay on security groups being applied

The FDP team is no longer accepting new bugs in Bugzilla. Please report your issues under FDP project in Jira. Thanks.

Bug 1695722 - ovn-controller is using 100% cpu on OVN deployment, delay on security groups being applied

Summary: ovn-controller is using 100% cpu on OVN deployment, delay on security groups ...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux Fast Datapath
Classification:	Red Hat
Component:	ovn2.11
Sub Component:
Version:	FDP 19.B
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Target Release:	---
Assignee:	Timothy Redaelli
QA Contact:	haidong li
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-04-03 16:55 UTC by Timothy Redaelli
Modified:	2020-01-19 06:45 UTC (History)
CC List:	3 users (show)
Fixed In Version:	ovn2.11-2.11.0-7.el7fdn
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-04-30 09:23:26 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHEA-2019:0907	0	None	None	None	2019-04-30 09:23:27 UTC

Description Timothy Redaelli 2019-04-03 16:55:38 UTC

This bug was initially created as a copy of Bug #1677616

I am copying this bug because: 



Description of problem:

Customer is facing delay on the security groups being applied 

"""
There is a delay in applying security groups on instances created by a heat template. Thus there is no connection between instances. After some time (about 30 min) it starts working.
Investigation revealed high CPU usage by the ovn-controller.
"""

I found there is alot of flows, on br-int ( Feb 12 )
$ wc -l ovs-ofctl_-O_OpenFlow13_dump-flows_br-int 

329665 ovs-ofctl_-O_OpenFlow13_dump-flows_br-int

This can be found on the journal:

Feb 11 16:17:51 cnifthc02cmp004 journal: 2019-02-11T21:17:51Z|1519076|timeval|WARN|Unreasonably long 5381ms poll interval (4889ms user, 492ms system)
Feb 11 16:17:51 cnifthc02cmp004 journal: 2019-02-11T21:17:51Z|1519077|timeval|WARN|faults: 571166 minor, 0 major
Feb 11 16:17:51 cnifthc02cmp004 journal: 2019-02-11T21:17:51Z|1519078|timeval|WARN|context switches: 0 voluntary, 7 involuntary
Feb 11 16:17:51 cnifthc02cmp004 journal: 2019-02-11T21:17:51Z|1519079|poll_loop|INFO|Dropped 1 log messages in last 5 seconds (most recently, 5 seconds ago) due to excessive rate
Feb 11 16:17:51 cnifthc02cmp004 journal: 2019-02-11T21:17:51Z|1519080|poll_loop|INFO|wakeup due to [POLLIN] on fd 8 (192.168.105.14:43410<->192.168.105.251:6642) at lib/stream-fd.c:157 (101% CPU usage)
Feb 11 16:17:51 cnifthc02cmp004 journal: 2019-02-11T21:17:51Z|1519081|pinctrl|INFO|Dropped 2 log messages in last 5 seconds (most recently, 5 seconds ago) due to excessive rate
Feb 11 16:17:51 cnifthc02cmp004 journal: 2019-02-11T21:17:51Z|1519082|pinctrl|INFO|DHCPACK fa:16:3e:89:44:2a 10.11.154.160
Feb 11 16:17:56 cnifthc02cmp004 journal: 2019-02-11T21:17:56Z|1519083|timeval|WARN|Unreasonably long 5252ms poll interval (4903ms user, 346ms system)
Feb 11 16:17:56 cnifthc02cmp004 journal: 2019-02-11T21:17:56Z|1519084|timeval|WARN|faults: 403610 minor, 0 major
Feb 11 16:17:56 cnifthc02cmp004 journal: 2019-02-11T21:17:56Z|1519085|timeval|WARN|context switches: 0 voluntary, 9 involuntary
Feb 11 16:17:56 cnifthc02cmp004 journal: 2019-02-11T21:17:56Z|1519086|poll_loop|INFO|Dropped 1 log messages in last 5 seconds (most recently, 5 seconds ago) due to excessive rate
Feb 11 16:17:56 cnifthc02cmp004 journal: 2019-02-11T21:17:56Z|1519087|poll_loop|INFO|wakeup due to [POLLIN] on fd 8 (192.168.105.14:43410<->192.168.105.251:6642) at lib/stream-fd.c:157 (101% CPU usage)
Feb 11 16:17:56 cnifthc02cmp004 journal: 2019-02-11T21:17:56Z|1519088|pinctrl|INFO|DHCPACK fa:16:3e:e5:df:b6 10.11.63.32
Feb 11 16:17:59 cnifthc02cmp004 dhclient[742515]: DHCPDISCOVER on enp7s0f0 to 255.255.255.255 port 67 interval 20 (xid=0x1fa5a1b6)
Feb 11 16:18:01 cnifthc02cmp004 journal: 2019-02-11T21:18:01Z|1519089|timeval|WARN|Unreasonably long 5054ms poll interval (4739ms user, 314ms system)
Feb 11 16:18:01 cnifthc02cmp004 journal: 2019-02-11T21:18:01Z|1519090|timeval|WARN|faults: 313461 minor, 0 major
Feb 11 16:18:01 cnifthc02cmp004 journal: 2019-02-11T21:18:01Z|1519091|timeval|WARN|context switches: 0 voluntary, 4 involuntary
Feb 11 16:18:01 cnifthc02cmp004 journal: 2019-02-11T21:18:01Z|1519092|poll_loop|INFO|Dropped 1 log messages in last 5 seconds (most recently, 5 seconds ago) due to excessive rate
Feb 11 16:18:01 cnifthc02cmp004 journal: 2019-02-11T21:18:01Z|1519093|poll_loop|INFO|wakeup due to [POLLIN] on fd 8 (192.168.105.14:43410<->192.168.105.251:6642) at lib/stream-fd.c:157 (101% CPU usage)
..
..
Feb 12 01:43:17 cnifthc02cmp004 journal: 2019-02-12T06:43:17Z|1561264|timeval|WARN|Unreasonably long 4598ms poll interval (4585ms user, 13ms system)
Feb 12 01:43:17 cnifthc02cmp004 journal: 2019-02-12T06:43:17Z|1561265|timeval|WARN|context switches: 0 voluntary, 1 involuntary
Feb 12 01:43:17 cnifthc02cmp004 journal: 2019-02-12T06:43:17Z|1561266|poll_loop|INFO|Dropped 1 log messages in last 5 seconds (most recently, 5 seconds ago) due to excessive rate
Feb 12 01:43:17 cnifthc02cmp004 journal: 2019-02-12T06:43:17Z|1561267|poll_loop|INFO|wakeup due to [POLLIN] on fd 8 (192.168.105.14:58120<->192.168.105.251:6642) at lib/stream-fd.c:157 (100% CPU usage)
Feb 12 01:43:17 cnifthc02cmp004 journal: 2019-02-12T06:43:17Z|1561268|pinctrl|INFO|Dropped 2 log messages in last 5 seconds (most recently, 5 seconds ago) due to excessive rate
Feb 12 01:43:17 cnifthc02cmp004 journal: 2019-02-12T06:43:17Z|1561269|pinctrl|INFO|DHCPACK fa:16:3e:40:d5:92 10.11.151.184
Feb 12 01:43:22 cnifthc02cmp004 journal: 2019-02-12T06:43:22Z|1561270|timeval|WARN|Unreasonably long 4713ms poll interval (4713ms user, 0ms system)
Feb 12 01:43:22 cnifthc02cmp004 journal: 2019-02-12T06:43:22Z|1561271|timeval|WARN|faults: 52678 minor, 0 major
Feb 12 01:43:22 cnifthc02cmp004 journal: 2019-02-12T06:43:22Z|1561272|timeval|WARN|context switches: 0 voluntary, 3 involuntary
Feb 12 01:43:22 cnifthc02cmp004 journal: 2019-02-12T06:43:22Z|1561273|pinctrl|INFO|Dropped 1 log messages in last 4 seconds (most recently, 4 seconds ago) due to excessive rate
Feb 12 01:43:22 cnifthc02cmp004 journal: 2019-02-12T06:43:22Z|1561274|pinctrl|INFO|DHCPACK fa:16:3e:e5:df:b6 10.11.63.32
Feb 12 01:43:22 cnifthc02cmp004 journal: 2019-02-12T06:43:22Z|1561275|pinctrl|INFO|DHCPACK fa:16:3e:e5:df:b6 10.11.63.32


These are number from today, when cu. faced the issue again:

[root@cnifthc02ctl001 heat-admin]# docker exec -itu root bde0cda6a714 /bin/bash
()[root@ctl001 /]# ovn-nbctl list ACL |grep uuid|wc -l
5051
()[root@ctl001 /]# ovn-nbctl list Logical_Switch | grep uuid | wc -l
62
()[root@ctl001 /]# ovn-nbctl list Logical_Router | grep uuid | wc -l
20
[rootctl001 /]# ovn-nbctl list Logical_Switch_Port | grep uuid | wc -l
501

Version-Release number of selected component (if applicable):
OSP13

How reproducible:
Second time it happens

Steps to Reproduce:
1. unsure
2.
3.

Actual results:
ovs high cpu usage.
Huge delay on security groups applied

Expected results:
none of the actual results

Additional info:

Comment 2 haidong li 2019-04-11 10:20:36 UTC

reproduced on the old version:

[root@hp-dl380pg8-04 opt]# uname -a
Linux hp-dl380pg8-04.rhts.eng.pek2.redhat.com 3.10.0-957.el7.x86_64 #1 SMP Thu Oct 4 20:48:51 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
[root@hp-dl380pg8-04 opt]# rpm -qa | grep openvswitch
openvswitch-selinux-extra-policy-1.0-11.el7fdp.noarch
openvswitch2.11-2.11.0-4.el7fdp.x86_64
kernel-kernel-networking-openvswitch-ovn-1.0-121.noarch
[root@hp-dl380pg8-04 opt]# rpm -qa | grep ovn
ovn2.11-host-2.11.0-4.el7fdp.x86_64
ovn2.11-common-2.11.0-4.el7fdp.x86_64
ovn2.11-central-2.11.0-4.el7fdp.x86_64
ovn2.11-2.11.0-4.el7fdp.x86_64
kernel-kernel-networking-openvswitch-ovn-1.0-121.noarch
[root@hp-dl380pg8-04 opt]# 

add a lot of logical ports and excute the script:
from scapy.all import *
for x in range(1000):  
  sendp(Ether(src="00:de:ad:01:00:01", dst="ff:ff:ff:ff:ff:ff")/IP(src="0.0.0.0", dst="255.255.255.255")/UDP(sport=68,dport=67)/BOOTP(chaddr="00:de:ad:01:00:01",ciaddr="0.0.0.0")/DHCP(options=[("message-type","discover")])/DHCP(options=[("subnet_mask","255.255.255.0")])/DHCP(options=[("router","0.0.0.0")])/DHCP(options=[("name_server", "0.0.0.0"), "end"]), iface="eth1")

[root@hp-dl380pg8-04 ovn]# top

top - 02:22:28 up 2 days,  3:03,  4 users,  load average: 0.80, 0.27, 0.23
Tasks: 314 total,   2 running, 312 sleeping,   0 stopped,   0 zombie
%Cpu(s):  6.1 us,  0.3 sy,  0.0 ni, 93.6 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem : 32802316 total, 25893460 free,  1955284 used,  4953572 buff/cache
KiB Swap: 16515068 total, 16515068 free,        0 used. 30065744 avail Mem 

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                
59606 root      10 -10  223824  19980   1676 R 100.3  0.1  82:17.21 ovn-controller                                         
60028 qemu      20   0 2932040   1.1g  10236 S  45.8  3.6  21:10.92 qemu-kvm                                               
59533 openvsw+  10 -10 1938872 241900  17392 S   6.0  0.7  14:44.17 ovs-vswitchd                                           
56162 root      20   0  347080   7036   5188 S   1.0  0.0   0:06.11 virsh                                                  
59079 root      20   0 1538776  26328  14312 S   1.0  0.1   0:08.69 libvirtd                                               
59943 root      20   0  150432   5684   4340 S   1.0  0.0   0:11.24 sshd                                                   
60230 root      20   0       0      0      0 S   0.7  0.0   0:10.58 vhost-60028                                            
   10 root      20   0       0      0      0 S   0.3  0.0   2:24.81 rcu_sched   
[root@hp-dl380pg8-04 opt]# cat /var/log/openvswitch/ovn-controller.log | grep CPU
2019-04-11T06:22:14.381Z|01075|poll_loop|INFO|wakeup due to [POLLIN] on fd 22 (<->/var/run/openvswitch/br-int.mgmt) at lib/stream-fd.c:157 (101% CPU usage)
2019-04-11T06:22:14.564Z|01076|poll_loop|INFO|wakeup due to [POLLIN] on fd 22 (<->/var/run/openvswitch/br-int.mgmt) at lib/stream-fd.c:157 (101% CPU usage)
2019-04-11T06:22:14.745Z|01077|poll_loop|INFO|wakeup due to [POLLIN] on fd 22 (<->/var/run/openvswitch/br-int.mgmt) at lib/stream-fd.c:157 (101% CPU usage)
2019-04-11T06:22:14.926Z|01078|poll_loop|INFO|wakeup due to [POLLIN] on fd 22 (<->/var/run/openvswitch/br-int.mgmt) at lib/stream-fd.c:157 (101% CPU usage)
2019-04-11T06:22:15.108Z|01079|poll_loop|INFO|wakeup due to [POLLIN] on fd 22 (<->/var/run/openvswitch/br-int.mgmt) at lib/stream-fd.c:157 (101% CPU usage)
2019-04-11T06:22:15.288Z|01080|poll_loop|INFO|wakeup due to [POLLIN] on fd 22 (<->/var/run/openvswitch/br-int.mgmt) at lib/stream-fd.c:157 (101% CPU usage)
2019-04-11T06:22:15.469Z|01081|poll_loop|INFO|wakeup due to [POLLIN] on fd 22 (<->/var/run/openvswitch/br-int.mgmt) at lib/stream-fd.c:157 (101% CPU usage)
2019-04-11T06:22:15.651Z|01082|poll_loop|INFO|wakeup due to [POLLIN] on fd 22 (<->/var/run/openvswitch/br-int.mgmt) at lib/stream-fd.c:157 (101% CPU usage)
2019-04-11T06:22:15.831Z|01085|poll_loop|INFO|wakeup due to [POLLIN] on fd 22 (<->/var/run/openvswitch/br-int.mgmt) at lib/stream-fd.c:157 (101% CPU usage)
2019-04-11T06:22:16.011Z|01086|poll_loop|INFO|wakeup due to [POLLIN] on fd 22 (<->/var/run/openvswitch/br-int.mgmt) at lib/stream-fd.c:157 (101% CPU usage)
2019-04-11T06:22:20.528Z|01090|poll_loop|INFO|wakeup due to [POLLIN] on fd 22 (<->/var/run/openvswitch/br-int.mgmt) at lib/stream-fd.c:157 (102% CPU usage)
2019-04-11T06:22:26.505Z|01096|poll_loop|INFO|wakeup due to [POLLIN] on fd 22 (<->/var/run/openvswitch/br-int.mgmt) at lib/stream-fd.c:157 (102% CPU usage)

===================================================================================================================================

verified on the latest version:

[root@hp-dl380pg8-04 scapy]# rpm -qa | grep openvswitch
openvswitch-selinux-extra-policy-1.0-11.el7fdp.noarch
openvswitch2.11-2.11.0-5.el7fdp.x86_64
kernel-kernel-networking-openvswitch-ovn-1.0-121.noarch
[root@hp-dl380pg8-04 scapy]# rpm -qa | grep ovn
ovn2.11-central-2.11.0-7.el7fdp.x86_64
ovn2.11-2.11.0-7.el7fdp.x86_64
ovn2.11-host-2.11.0-7.el7fdp.x86_64
kernel-kernel-networking-openvswitch-ovn-1.0-121.noarch
[root@hp-dl380pg8-04 scapy]# 

[root@hp-dl380pg8-04 scapy]# 
top - 06:15:23 up 2 days,  6:56,  4 users,  load average: 0.09, 0.38, 0.52
Tasks: 314 total,   1 running, 313 sleeping,   0 stopped,   0 zombie
%Cpu(s):  1.5 us,  0.4 sy,  0.0 ni, 98.1 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem : 32802316 total, 26416436 free,  1588120 used,  4797760 buff/cache
KiB Swap: 16515068 total, 16515068 free,        0 used. 30434164 avail Mem 

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND     
60896 qemu      20   0 2929992 810684  10236 S  43.9  2.5   1:30.35 qemu-kvm    
60409 openvsw+  10 -10 1939560 234360  17392 S   7.0  0.7   0:25.01 ovs-vswitc+ 
59570 root      20   0 1547032  24220  14524 S   1.0  0.1   0:01.90 libvirtd    
56012 root      20   0  347080   6876   5164 S   0.7  0.0   0:00.54 virsh       
59943 root      20   0  150432   5684   4340 S   0.7  0.0   0:11.81 sshd        
61121 root      20   0       0      0      0 S   0.7  0.0   0:00.35 vhost-60896 
   10 root      20   0       0      0      0 S   0.3  0.0   2:32.25 rcu_sched   
  115 root      rt   0       0      0      0 S   0.3  0.0   0:01.14 migration/+ 
  145 root      20   0       0      0      0 S   0.3  0.0   0:01.99 kworker/11+ 
 3751 root      20   0       0      0      0 S   0.3  0.0   0:15.79 xfsaild/dm+ 
 6782 root      20   0       0      0      0 S   0.3  0.0   0:30.08 xfsaild/dm+ 
50665 root      20   0  162144   2456   1592 R   0.3  0.0   0:00.10 top

Comment 4 errata-xmlrpc 2019-04-30 09:23:26 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2019:0907

Note You need to log in before you can comment on or make changes to this bug.