The FDP team is no longer accepting new bugs in Bugzilla. Please report your issues under FDP project in Jira. Thanks.
Bug 1695722 - ovn-controller is using 100% cpu on OVN deployment, delay on security groups being applied
Summary: ovn-controller is using 100% cpu on OVN deployment, delay on security groups ...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux Fast Datapath
Classification: Red Hat
Component: ovn2.11
Version: FDP 19.B
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: ---
Assignee: Timothy Redaelli
QA Contact: haidong li
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-04-03 16:55 UTC by Timothy Redaelli
Modified: 2020-01-19 06:45 UTC (History)
3 users (show)

Fixed In Version: ovn2.11-2.11.0-7.el7fdn
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-04-30 09:23:26 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHEA-2019:0907 0 None None None 2019-04-30 09:23:27 UTC

Description Timothy Redaelli 2019-04-03 16:55:38 UTC
This bug was initially created as a copy of Bug #1677616

I am copying this bug because: 



Description of problem:

Customer is facing delay on the security groups being applied 

"""
There is a delay in applying security groups on instances created by a heat template. Thus there is no connection between instances. After some time (about 30 min) it starts working.
Investigation revealed high CPU usage by the ovn-controller.
"""

I found there is alot of flows, on br-int ( Feb 12 )
$ wc -l ovs-ofctl_-O_OpenFlow13_dump-flows_br-int 

329665 ovs-ofctl_-O_OpenFlow13_dump-flows_br-int

This can be found on the journal:

Feb 11 16:17:51 cnifthc02cmp004 journal: 2019-02-11T21:17:51Z|1519076|timeval|WARN|Unreasonably long 5381ms poll interval (4889ms user, 492ms system)
Feb 11 16:17:51 cnifthc02cmp004 journal: 2019-02-11T21:17:51Z|1519077|timeval|WARN|faults: 571166 minor, 0 major
Feb 11 16:17:51 cnifthc02cmp004 journal: 2019-02-11T21:17:51Z|1519078|timeval|WARN|context switches: 0 voluntary, 7 involuntary
Feb 11 16:17:51 cnifthc02cmp004 journal: 2019-02-11T21:17:51Z|1519079|poll_loop|INFO|Dropped 1 log messages in last 5 seconds (most recently, 5 seconds ago) due to excessive rate
Feb 11 16:17:51 cnifthc02cmp004 journal: 2019-02-11T21:17:51Z|1519080|poll_loop|INFO|wakeup due to [POLLIN] on fd 8 (192.168.105.14:43410<->192.168.105.251:6642) at lib/stream-fd.c:157 (101% CPU usage)
Feb 11 16:17:51 cnifthc02cmp004 journal: 2019-02-11T21:17:51Z|1519081|pinctrl|INFO|Dropped 2 log messages in last 5 seconds (most recently, 5 seconds ago) due to excessive rate
Feb 11 16:17:51 cnifthc02cmp004 journal: 2019-02-11T21:17:51Z|1519082|pinctrl|INFO|DHCPACK fa:16:3e:89:44:2a 10.11.154.160
Feb 11 16:17:56 cnifthc02cmp004 journal: 2019-02-11T21:17:56Z|1519083|timeval|WARN|Unreasonably long 5252ms poll interval (4903ms user, 346ms system)
Feb 11 16:17:56 cnifthc02cmp004 journal: 2019-02-11T21:17:56Z|1519084|timeval|WARN|faults: 403610 minor, 0 major
Feb 11 16:17:56 cnifthc02cmp004 journal: 2019-02-11T21:17:56Z|1519085|timeval|WARN|context switches: 0 voluntary, 9 involuntary
Feb 11 16:17:56 cnifthc02cmp004 journal: 2019-02-11T21:17:56Z|1519086|poll_loop|INFO|Dropped 1 log messages in last 5 seconds (most recently, 5 seconds ago) due to excessive rate
Feb 11 16:17:56 cnifthc02cmp004 journal: 2019-02-11T21:17:56Z|1519087|poll_loop|INFO|wakeup due to [POLLIN] on fd 8 (192.168.105.14:43410<->192.168.105.251:6642) at lib/stream-fd.c:157 (101% CPU usage)
Feb 11 16:17:56 cnifthc02cmp004 journal: 2019-02-11T21:17:56Z|1519088|pinctrl|INFO|DHCPACK fa:16:3e:e5:df:b6 10.11.63.32
Feb 11 16:17:59 cnifthc02cmp004 dhclient[742515]: DHCPDISCOVER on enp7s0f0 to 255.255.255.255 port 67 interval 20 (xid=0x1fa5a1b6)
Feb 11 16:18:01 cnifthc02cmp004 journal: 2019-02-11T21:18:01Z|1519089|timeval|WARN|Unreasonably long 5054ms poll interval (4739ms user, 314ms system)
Feb 11 16:18:01 cnifthc02cmp004 journal: 2019-02-11T21:18:01Z|1519090|timeval|WARN|faults: 313461 minor, 0 major
Feb 11 16:18:01 cnifthc02cmp004 journal: 2019-02-11T21:18:01Z|1519091|timeval|WARN|context switches: 0 voluntary, 4 involuntary
Feb 11 16:18:01 cnifthc02cmp004 journal: 2019-02-11T21:18:01Z|1519092|poll_loop|INFO|Dropped 1 log messages in last 5 seconds (most recently, 5 seconds ago) due to excessive rate
Feb 11 16:18:01 cnifthc02cmp004 journal: 2019-02-11T21:18:01Z|1519093|poll_loop|INFO|wakeup due to [POLLIN] on fd 8 (192.168.105.14:43410<->192.168.105.251:6642) at lib/stream-fd.c:157 (101% CPU usage)
..
..
Feb 12 01:43:17 cnifthc02cmp004 journal: 2019-02-12T06:43:17Z|1561264|timeval|WARN|Unreasonably long 4598ms poll interval (4585ms user, 13ms system)
Feb 12 01:43:17 cnifthc02cmp004 journal: 2019-02-12T06:43:17Z|1561265|timeval|WARN|context switches: 0 voluntary, 1 involuntary
Feb 12 01:43:17 cnifthc02cmp004 journal: 2019-02-12T06:43:17Z|1561266|poll_loop|INFO|Dropped 1 log messages in last 5 seconds (most recently, 5 seconds ago) due to excessive rate
Feb 12 01:43:17 cnifthc02cmp004 journal: 2019-02-12T06:43:17Z|1561267|poll_loop|INFO|wakeup due to [POLLIN] on fd 8 (192.168.105.14:58120<->192.168.105.251:6642) at lib/stream-fd.c:157 (100% CPU usage)
Feb 12 01:43:17 cnifthc02cmp004 journal: 2019-02-12T06:43:17Z|1561268|pinctrl|INFO|Dropped 2 log messages in last 5 seconds (most recently, 5 seconds ago) due to excessive rate
Feb 12 01:43:17 cnifthc02cmp004 journal: 2019-02-12T06:43:17Z|1561269|pinctrl|INFO|DHCPACK fa:16:3e:40:d5:92 10.11.151.184
Feb 12 01:43:22 cnifthc02cmp004 journal: 2019-02-12T06:43:22Z|1561270|timeval|WARN|Unreasonably long 4713ms poll interval (4713ms user, 0ms system)
Feb 12 01:43:22 cnifthc02cmp004 journal: 2019-02-12T06:43:22Z|1561271|timeval|WARN|faults: 52678 minor, 0 major
Feb 12 01:43:22 cnifthc02cmp004 journal: 2019-02-12T06:43:22Z|1561272|timeval|WARN|context switches: 0 voluntary, 3 involuntary
Feb 12 01:43:22 cnifthc02cmp004 journal: 2019-02-12T06:43:22Z|1561273|pinctrl|INFO|Dropped 1 log messages in last 4 seconds (most recently, 4 seconds ago) due to excessive rate
Feb 12 01:43:22 cnifthc02cmp004 journal: 2019-02-12T06:43:22Z|1561274|pinctrl|INFO|DHCPACK fa:16:3e:e5:df:b6 10.11.63.32
Feb 12 01:43:22 cnifthc02cmp004 journal: 2019-02-12T06:43:22Z|1561275|pinctrl|INFO|DHCPACK fa:16:3e:e5:df:b6 10.11.63.32


These are number from today, when cu. faced the issue again:

[root@cnifthc02ctl001 heat-admin]# docker exec -itu root bde0cda6a714 /bin/bash
()[root@ctl001 /]# ovn-nbctl list ACL |grep uuid|wc -l
5051
()[root@ctl001 /]# ovn-nbctl list Logical_Switch | grep uuid | wc -l
62
()[root@ctl001 /]# ovn-nbctl list Logical_Router | grep uuid | wc -l
20
[rootctl001 /]# ovn-nbctl list Logical_Switch_Port | grep uuid | wc -l
501

Version-Release number of selected component (if applicable):
OSP13

How reproducible:
Second time it happens

Steps to Reproduce:
1. unsure
2.
3.

Actual results:
ovs high cpu usage.
Huge delay on security groups applied

Expected results:
none of the actual results

Additional info:

Comment 2 haidong li 2019-04-11 10:20:36 UTC
reproduced on the old version:

[root@hp-dl380pg8-04 opt]# uname -a
Linux hp-dl380pg8-04.rhts.eng.pek2.redhat.com 3.10.0-957.el7.x86_64 #1 SMP Thu Oct 4 20:48:51 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
[root@hp-dl380pg8-04 opt]# rpm -qa | grep openvswitch
openvswitch-selinux-extra-policy-1.0-11.el7fdp.noarch
openvswitch2.11-2.11.0-4.el7fdp.x86_64
kernel-kernel-networking-openvswitch-ovn-1.0-121.noarch
[root@hp-dl380pg8-04 opt]# rpm -qa | grep ovn
ovn2.11-host-2.11.0-4.el7fdp.x86_64
ovn2.11-common-2.11.0-4.el7fdp.x86_64
ovn2.11-central-2.11.0-4.el7fdp.x86_64
ovn2.11-2.11.0-4.el7fdp.x86_64
kernel-kernel-networking-openvswitch-ovn-1.0-121.noarch
[root@hp-dl380pg8-04 opt]# 

add a lot of logical ports and excute the script:
from scapy.all import *
for x in range(1000):  
  sendp(Ether(src="00:de:ad:01:00:01", dst="ff:ff:ff:ff:ff:ff")/IP(src="0.0.0.0", dst="255.255.255.255")/UDP(sport=68,dport=67)/BOOTP(chaddr="00:de:ad:01:00:01",ciaddr="0.0.0.0")/DHCP(options=[("message-type","discover")])/DHCP(options=[("subnet_mask","255.255.255.0")])/DHCP(options=[("router","0.0.0.0")])/DHCP(options=[("name_server", "0.0.0.0"), "end"]), iface="eth1")

[root@hp-dl380pg8-04 ovn]# top

top - 02:22:28 up 2 days,  3:03,  4 users,  load average: 0.80, 0.27, 0.23
Tasks: 314 total,   2 running, 312 sleeping,   0 stopped,   0 zombie
%Cpu(s):  6.1 us,  0.3 sy,  0.0 ni, 93.6 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem : 32802316 total, 25893460 free,  1955284 used,  4953572 buff/cache
KiB Swap: 16515068 total, 16515068 free,        0 used. 30065744 avail Mem 

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                
59606 root      10 -10  223824  19980   1676 R 100.3  0.1  82:17.21 ovn-controller                                         
60028 qemu      20   0 2932040   1.1g  10236 S  45.8  3.6  21:10.92 qemu-kvm                                               
59533 openvsw+  10 -10 1938872 241900  17392 S   6.0  0.7  14:44.17 ovs-vswitchd                                           
56162 root      20   0  347080   7036   5188 S   1.0  0.0   0:06.11 virsh                                                  
59079 root      20   0 1538776  26328  14312 S   1.0  0.1   0:08.69 libvirtd                                               
59943 root      20   0  150432   5684   4340 S   1.0  0.0   0:11.24 sshd                                                   
60230 root      20   0       0      0      0 S   0.7  0.0   0:10.58 vhost-60028                                            
   10 root      20   0       0      0      0 S   0.3  0.0   2:24.81 rcu_sched   
[root@hp-dl380pg8-04 opt]# cat /var/log/openvswitch/ovn-controller.log | grep CPU
2019-04-11T06:22:14.381Z|01075|poll_loop|INFO|wakeup due to [POLLIN] on fd 22 (<->/var/run/openvswitch/br-int.mgmt) at lib/stream-fd.c:157 (101% CPU usage)
2019-04-11T06:22:14.564Z|01076|poll_loop|INFO|wakeup due to [POLLIN] on fd 22 (<->/var/run/openvswitch/br-int.mgmt) at lib/stream-fd.c:157 (101% CPU usage)
2019-04-11T06:22:14.745Z|01077|poll_loop|INFO|wakeup due to [POLLIN] on fd 22 (<->/var/run/openvswitch/br-int.mgmt) at lib/stream-fd.c:157 (101% CPU usage)
2019-04-11T06:22:14.926Z|01078|poll_loop|INFO|wakeup due to [POLLIN] on fd 22 (<->/var/run/openvswitch/br-int.mgmt) at lib/stream-fd.c:157 (101% CPU usage)
2019-04-11T06:22:15.108Z|01079|poll_loop|INFO|wakeup due to [POLLIN] on fd 22 (<->/var/run/openvswitch/br-int.mgmt) at lib/stream-fd.c:157 (101% CPU usage)
2019-04-11T06:22:15.288Z|01080|poll_loop|INFO|wakeup due to [POLLIN] on fd 22 (<->/var/run/openvswitch/br-int.mgmt) at lib/stream-fd.c:157 (101% CPU usage)
2019-04-11T06:22:15.469Z|01081|poll_loop|INFO|wakeup due to [POLLIN] on fd 22 (<->/var/run/openvswitch/br-int.mgmt) at lib/stream-fd.c:157 (101% CPU usage)
2019-04-11T06:22:15.651Z|01082|poll_loop|INFO|wakeup due to [POLLIN] on fd 22 (<->/var/run/openvswitch/br-int.mgmt) at lib/stream-fd.c:157 (101% CPU usage)
2019-04-11T06:22:15.831Z|01085|poll_loop|INFO|wakeup due to [POLLIN] on fd 22 (<->/var/run/openvswitch/br-int.mgmt) at lib/stream-fd.c:157 (101% CPU usage)
2019-04-11T06:22:16.011Z|01086|poll_loop|INFO|wakeup due to [POLLIN] on fd 22 (<->/var/run/openvswitch/br-int.mgmt) at lib/stream-fd.c:157 (101% CPU usage)
2019-04-11T06:22:20.528Z|01090|poll_loop|INFO|wakeup due to [POLLIN] on fd 22 (<->/var/run/openvswitch/br-int.mgmt) at lib/stream-fd.c:157 (102% CPU usage)
2019-04-11T06:22:26.505Z|01096|poll_loop|INFO|wakeup due to [POLLIN] on fd 22 (<->/var/run/openvswitch/br-int.mgmt) at lib/stream-fd.c:157 (102% CPU usage)

===================================================================================================================================

verified on the latest version:

[root@hp-dl380pg8-04 scapy]# rpm -qa | grep openvswitch
openvswitch-selinux-extra-policy-1.0-11.el7fdp.noarch
openvswitch2.11-2.11.0-5.el7fdp.x86_64
kernel-kernel-networking-openvswitch-ovn-1.0-121.noarch
[root@hp-dl380pg8-04 scapy]# rpm -qa | grep ovn
ovn2.11-central-2.11.0-7.el7fdp.x86_64
ovn2.11-2.11.0-7.el7fdp.x86_64
ovn2.11-host-2.11.0-7.el7fdp.x86_64
kernel-kernel-networking-openvswitch-ovn-1.0-121.noarch
[root@hp-dl380pg8-04 scapy]# 

[root@hp-dl380pg8-04 scapy]# 
top - 06:15:23 up 2 days,  6:56,  4 users,  load average: 0.09, 0.38, 0.52
Tasks: 314 total,   1 running, 313 sleeping,   0 stopped,   0 zombie
%Cpu(s):  1.5 us,  0.4 sy,  0.0 ni, 98.1 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem : 32802316 total, 26416436 free,  1588120 used,  4797760 buff/cache
KiB Swap: 16515068 total, 16515068 free,        0 used. 30434164 avail Mem 

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND     
60896 qemu      20   0 2929992 810684  10236 S  43.9  2.5   1:30.35 qemu-kvm    
60409 openvsw+  10 -10 1939560 234360  17392 S   7.0  0.7   0:25.01 ovs-vswitc+ 
59570 root      20   0 1547032  24220  14524 S   1.0  0.1   0:01.90 libvirtd    
56012 root      20   0  347080   6876   5164 S   0.7  0.0   0:00.54 virsh       
59943 root      20   0  150432   5684   4340 S   0.7  0.0   0:11.81 sshd        
61121 root      20   0       0      0      0 S   0.7  0.0   0:00.35 vhost-60896 
   10 root      20   0       0      0      0 S   0.3  0.0   2:32.25 rcu_sched   
  115 root      rt   0       0      0      0 S   0.3  0.0   0:01.14 migration/+ 
  145 root      20   0       0      0      0 S   0.3  0.0   0:01.99 kworker/11+ 
 3751 root      20   0       0      0      0 S   0.3  0.0   0:15.79 xfsaild/dm+ 
 6782 root      20   0       0      0      0 S   0.3  0.0   0:30.08 xfsaild/dm+ 
50665 root      20   0  162144   2456   1592 R   0.3  0.0   0:00.10 top

Comment 4 errata-xmlrpc 2019-04-30 09:23:26 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2019:0907


Note You need to log in before you can comment on or make changes to this bug.