Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.
The FDP team is no longer accepting new bugs in Bugzilla. Please report your issues under FDP project in Jira. Thanks.

Bug 1695725

Summary: ovn-controller is using 100% cpu on OVN deployment, delay on security groups being applied
Product: Red Hat Enterprise Linux Fast Datapath Reporter: Timothy Redaelli <tredaelli>
Component: openvswitchAssignee: Timothy Redaelli <tredaelli>
openvswitch sub component: daemons and tools QA Contact: haidong li <haili>
Status: CLOSED ERRATA Docs Contact:
Severity: unspecified    
Priority: unspecified CC: atragler, ctrautma, jhsiao, pmorey, pvauter, qding, yinxu
Version: FDP 19.B   
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: openvswitch-2.9.0-103.el7fdn Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-04-29 09:26:47 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Timothy Redaelli 2019-04-03 16:58:53 UTC
This bug was initially created as a copy of Bug #1695722

I am copying this bug because: 



This bug was initially created as a copy of Bug #1677616

I am copying this bug because: 



Description of problem:

Customer is facing delay on the security groups being applied 

"""
There is a delay in applying security groups on instances created by a heat template. Thus there is no connection between instances. After some time (about 30 min) it starts working.
Investigation revealed high CPU usage by the ovn-controller.
"""

I found there is alot of flows, on br-int ( Feb 12 )
$ wc -l ovs-ofctl_-O_OpenFlow13_dump-flows_br-int 

329665 ovs-ofctl_-O_OpenFlow13_dump-flows_br-int

This can be found on the journal:

Feb 11 16:17:51 cnifthc02cmp004 journal: 2019-02-11T21:17:51Z|1519076|timeval|WARN|Unreasonably long 5381ms poll interval (4889ms user, 492ms system)
Feb 11 16:17:51 cnifthc02cmp004 journal: 2019-02-11T21:17:51Z|1519077|timeval|WARN|faults: 571166 minor, 0 major
Feb 11 16:17:51 cnifthc02cmp004 journal: 2019-02-11T21:17:51Z|1519078|timeval|WARN|context switches: 0 voluntary, 7 involuntary
Feb 11 16:17:51 cnifthc02cmp004 journal: 2019-02-11T21:17:51Z|1519079|poll_loop|INFO|Dropped 1 log messages in last 5 seconds (most recently, 5 seconds ago) due to excessive rate
Feb 11 16:17:51 cnifthc02cmp004 journal: 2019-02-11T21:17:51Z|1519080|poll_loop|INFO|wakeup due to [POLLIN] on fd 8 (192.168.105.14:43410<->192.168.105.251:6642) at lib/stream-fd.c:157 (101% CPU usage)
Feb 11 16:17:51 cnifthc02cmp004 journal: 2019-02-11T21:17:51Z|1519081|pinctrl|INFO|Dropped 2 log messages in last 5 seconds (most recently, 5 seconds ago) due to excessive rate
Feb 11 16:17:51 cnifthc02cmp004 journal: 2019-02-11T21:17:51Z|1519082|pinctrl|INFO|DHCPACK fa:16:3e:89:44:2a 10.11.154.160
Feb 11 16:17:56 cnifthc02cmp004 journal: 2019-02-11T21:17:56Z|1519083|timeval|WARN|Unreasonably long 5252ms poll interval (4903ms user, 346ms system)
Feb 11 16:17:56 cnifthc02cmp004 journal: 2019-02-11T21:17:56Z|1519084|timeval|WARN|faults: 403610 minor, 0 major
Feb 11 16:17:56 cnifthc02cmp004 journal: 2019-02-11T21:17:56Z|1519085|timeval|WARN|context switches: 0 voluntary, 9 involuntary
Feb 11 16:17:56 cnifthc02cmp004 journal: 2019-02-11T21:17:56Z|1519086|poll_loop|INFO|Dropped 1 log messages in last 5 seconds (most recently, 5 seconds ago) due to excessive rate
Feb 11 16:17:56 cnifthc02cmp004 journal: 2019-02-11T21:17:56Z|1519087|poll_loop|INFO|wakeup due to [POLLIN] on fd 8 (192.168.105.14:43410<->192.168.105.251:6642) at lib/stream-fd.c:157 (101% CPU usage)
Feb 11 16:17:56 cnifthc02cmp004 journal: 2019-02-11T21:17:56Z|1519088|pinctrl|INFO|DHCPACK fa:16:3e:e5:df:b6 10.11.63.32
Feb 11 16:17:59 cnifthc02cmp004 dhclient[742515]: DHCPDISCOVER on enp7s0f0 to 255.255.255.255 port 67 interval 20 (xid=0x1fa5a1b6)
Feb 11 16:18:01 cnifthc02cmp004 journal: 2019-02-11T21:18:01Z|1519089|timeval|WARN|Unreasonably long 5054ms poll interval (4739ms user, 314ms system)
Feb 11 16:18:01 cnifthc02cmp004 journal: 2019-02-11T21:18:01Z|1519090|timeval|WARN|faults: 313461 minor, 0 major
Feb 11 16:18:01 cnifthc02cmp004 journal: 2019-02-11T21:18:01Z|1519091|timeval|WARN|context switches: 0 voluntary, 4 involuntary
Feb 11 16:18:01 cnifthc02cmp004 journal: 2019-02-11T21:18:01Z|1519092|poll_loop|INFO|Dropped 1 log messages in last 5 seconds (most recently, 5 seconds ago) due to excessive rate
Feb 11 16:18:01 cnifthc02cmp004 journal: 2019-02-11T21:18:01Z|1519093|poll_loop|INFO|wakeup due to [POLLIN] on fd 8 (192.168.105.14:43410<->192.168.105.251:6642) at lib/stream-fd.c:157 (101% CPU usage)
..
..
Feb 12 01:43:17 cnifthc02cmp004 journal: 2019-02-12T06:43:17Z|1561264|timeval|WARN|Unreasonably long 4598ms poll interval (4585ms user, 13ms system)
Feb 12 01:43:17 cnifthc02cmp004 journal: 2019-02-12T06:43:17Z|1561265|timeval|WARN|context switches: 0 voluntary, 1 involuntary
Feb 12 01:43:17 cnifthc02cmp004 journal: 2019-02-12T06:43:17Z|1561266|poll_loop|INFO|Dropped 1 log messages in last 5 seconds (most recently, 5 seconds ago) due to excessive rate
Feb 12 01:43:17 cnifthc02cmp004 journal: 2019-02-12T06:43:17Z|1561267|poll_loop|INFO|wakeup due to [POLLIN] on fd 8 (192.168.105.14:58120<->192.168.105.251:6642) at lib/stream-fd.c:157 (100% CPU usage)
Feb 12 01:43:17 cnifthc02cmp004 journal: 2019-02-12T06:43:17Z|1561268|pinctrl|INFO|Dropped 2 log messages in last 5 seconds (most recently, 5 seconds ago) due to excessive rate
Feb 12 01:43:17 cnifthc02cmp004 journal: 2019-02-12T06:43:17Z|1561269|pinctrl|INFO|DHCPACK fa:16:3e:40:d5:92 10.11.151.184
Feb 12 01:43:22 cnifthc02cmp004 journal: 2019-02-12T06:43:22Z|1561270|timeval|WARN|Unreasonably long 4713ms poll interval (4713ms user, 0ms system)
Feb 12 01:43:22 cnifthc02cmp004 journal: 2019-02-12T06:43:22Z|1561271|timeval|WARN|faults: 52678 minor, 0 major
Feb 12 01:43:22 cnifthc02cmp004 journal: 2019-02-12T06:43:22Z|1561272|timeval|WARN|context switches: 0 voluntary, 3 involuntary
Feb 12 01:43:22 cnifthc02cmp004 journal: 2019-02-12T06:43:22Z|1561273|pinctrl|INFO|Dropped 1 log messages in last 4 seconds (most recently, 4 seconds ago) due to excessive rate
Feb 12 01:43:22 cnifthc02cmp004 journal: 2019-02-12T06:43:22Z|1561274|pinctrl|INFO|DHCPACK fa:16:3e:e5:df:b6 10.11.63.32
Feb 12 01:43:22 cnifthc02cmp004 journal: 2019-02-12T06:43:22Z|1561275|pinctrl|INFO|DHCPACK fa:16:3e:e5:df:b6 10.11.63.32


These are number from today, when cu. faced the issue again:

[root@cnifthc02ctl001 heat-admin]# docker exec -itu root bde0cda6a714 /bin/bash
()[root@ctl001 /]# ovn-nbctl list ACL |grep uuid|wc -l
5051
()[root@ctl001 /]# ovn-nbctl list Logical_Switch | grep uuid | wc -l
62
()[root@ctl001 /]# ovn-nbctl list Logical_Router | grep uuid | wc -l
20
[rootctl001 /]# ovn-nbctl list Logical_Switch_Port | grep uuid | wc -l
501

Version-Release number of selected component (if applicable):
OSP13

How reproducible:
Second time it happens

Steps to Reproduce:
1. unsure
2.
3.

Actual results:
ovs high cpu usage.
Huge delay on security groups applied

Expected results:
none of the actual results

Additional info:

Comment 2 haidong li 2019-04-12 05:36:42 UTC
reproduced in the old version:

run the scripts:
from scapy.all import *;
sendp(Ether(src="00:de:ad:01:00:01", dst="ff:ff:ff:ff:ff:ff")/IP(src="0.0.0.0", dst="255.255.255.255")/UDP(sport=68,dport=67)/BOOTP(chaddr="00:de:ad:01:00:01",ciaddr="0.0.0.0")/DHCP(options=[("message-type","discover")])/DHCP(options=[("subnet_mask","255.255.255.0")])/DHCP(options=[("router","0.0.0.0")])/DHCP(options=[("name_server", "0.0.0.0"), "end"]), iface="eth1")

[root@hp-dl380pg8-04 opt]# 
[root@hp-dl380pg8-04 opt]# rpm -qa | grep openvswitch
openvswitch-ovn-central-2.9.0-101.el7fdp.x86_64
openvswitch-selinux-extra-policy-1.0-11.el7fdp.noarch
openvswitch-ovn-common-2.9.0-101.el7fdp.x86_64
openvswitch-ovn-host-2.9.0-101.el7fdp.x86_64
openvswitch-2.9.0-101.el7fdp.x86_64
kernel-kernel-networking-openvswitch-ovn-1.0-121.noarch

top - 23:20:57 up 3 days, 2 min,  4 users,  load average: 0.43, 0.26, 0.20
Tasks: 313 total,   3 running, 310 sleeping,   0 stopped,   0 zombie
%Cpu(s):  5.5 us,  0.2 sy,  0.0 ni, 94.3 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem : 32802316 total, 26465500 free,  1545444 used,  4791372 buff/cache
KiB Swap: 16515068 total, 16515068 free,        0 used. 30479248 avail Mem 

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND     
55952 root      10 -10   73684  22980   1372 R 100.0  0.1   4:32.46 ovn-contro+ 
56363 qemu      20   0 2926928 774936  10232 S  44.5  2.4   1:21.13 qemu-kvm    
55883 openvsw+  10 -10 1921136 225832  13600 S   5.3  0.7   0:15.96 ovs-vswitc+ 
15111 root      20   0  151572   6748   4340 S   1.0  0.0   0:05.10 sshd        
55081 root      20   0 1546968  24168  14524 S   1.0  0.1   0:01.79 libvirtd    
 5750 root      20   0  347080   6876   5164 S   0.7  0.0   0:00.39 virsh       
56551 root      20   0       0      0      0 S   0.7  0.0   0:00.18 vhost-56363 
   10 root      20   0       0      0      0 S   0.3  0.0   3:52.26 rcu_sched   
 5869 root      20   0  162144   2460   1592 R   0.3  0.0   0:00.05 top         
17379 root      20   0  169616  26568   5424 S   0.3  0.1   0:27.75 beah-rhts-+ 
    1 root      20   0  195252   8424   4220 S   0.0  0.0   0:32.70 systemd     
    2 root      20   0       0      0      0 S   0.0  0.0   0:00.08 kthreadd    
    3 root      20   0       0      0      0 S   0.0  0.0   0:00.39 ksoftirqd/0 
    5 root       0 -20       0      0      0 S   0.0  0.0   0:00.00 kworker/0:+ 
    8 root      rt   0       0      0      0 S   0.0  0.0   0:00.94 migration/0 
    9 root      20   0       0      0      0 S   0.0  0.0   0:00.00 rcu_bh      
   11 root       0 -20       0      0      0 S   0.0  0.0   0:00.00 lru-add-dr+ 

[root@hp-dl380pg8-04 opt]# cat /var/log/openvswitch/ovn-controller.log | grep CPU
2019-04-12T03:20:14.019Z|00155|poll_loop|INFO|wakeup due to [POLLIN] on fd 15 (<->/var/run/openvswitch/br-int.mgmt) at ../lib/stream-fd.c:157 (75% CPU usage)
2019-04-12T03:20:18.479Z|00161|poll_loop|INFO|wakeup due to [POLLIN] on fd 15 (<->/var/run/openvswitch/br-int.mgmt) at ../lib/stream-fd.c:157 (103% CPU usage)
2019-04-12T03:20:24.500Z|00167|poll_loop|INFO|wakeup due to [POLLIN] on fd 15 (<->/var/run/openvswitch/br-int.mgmt) at ../lib/stream-fd.c:157 (100% CPU usage)
2019-04-12T03:20:50.212Z|00180|poll_loop|INFO|wakeup due to [POLLIN] on fd 15 (<->/var/run/openvswitch/br-int.mgmt) at ../lib/stream-fd.c:157 (102% CPU usage)
2019-04-12T03:20:50.426Z|00181|poll_loop|INFO|wakeup due to [POLLIN] on fd 15 (<->/var/run/openvswitch/br-int.mgmt) at ../lib/stream-fd.c:157 (102% CPU usage)
2019-04-12T03:20:50.639Z|00182|poll_loop|INFO|wakeup due to [POLLIN] on fd 15 (<->/var/run/openvswitch/br-int.mgmt) at ../lib/stream-fd.c:157 (102% CPU usage)
2019-04-12T03:20:50.851Z|00183|poll_loop|INFO|wakeup due to [POLLIN] on fd 15 (<->/var/run/openvswitch/br-int.mgmt) at ../lib/stream-fd.c:157 (102% CPU usage)
2019-04-12T03:20:54.463Z|00189|poll_loop|INFO|wakeup due to [POLLIN] on fd 15 (<->/var/run/openvswitch/br-int.mgmt) at ../lib/stream-fd.c:157 (102% CPU usage)
2019-04-12T03:21:00.523Z|00195|poll_loop|INFO|wakeup due to [POLLIN] on fd 15 (<->/var/run/openvswitch/br-int.mgmt) at ../lib/stream-fd.c:157 (101% CPU usage)

=====================================================================================================================================
verified on the latest version:

[root@hp-dl380pg8-04 opt]# rpm -qa | grep openvswitch
openvswitch-ovn-central-2.9.0-104.el7fdp.x86_64
openvswitch-selinux-extra-policy-1.0-11.el7fdp.noarch
openvswitch-ovn-common-2.9.0-104.el7fdp.x86_64
openvswitch-ovn-host-2.9.0-104.el7fdp.x86_64
openvswitch-2.9.0-104.el7fdp.x86_64
kernel-kernel-networking-openvswitch-ovn-1.0-121.noarch

top - 01:30:08 up 3 days,  2:11,  4 users,  load average: 0.33, 0.09, 0.13
Tasks: 313 total,   1 running, 312 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.5 us,  0.2 sy,  0.0 ni, 99.3 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem : 32802316 total, 26291976 free,  1646568 used,  4863772 buff/cache
KiB Swap: 16515068 total, 16515068 free,        0 used. 30367344 avail Mem 

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND     
 8175 qemu      20   0 3081616 879096  10236 S  15.3  2.7   1:43.01 qemu-kvm    
 7695 openvsw+  10 -10 1921136 226484  13600 S   2.0  0.7   0:41.80 ovs-vswitc+ 
 3751 root      20   0       0      0      0 S   0.3  0.0   0:17.86 xfsaild/dm+ 
 6725 root      20   0 1546968  24152  14524 S   0.3  0.1   0:01.65 libvirtd    
 7645 openvsw+  10 -10   55016   2764   1608 S   0.3  0.0   0:00.56 ovsdb-serv+ 
15111 root      20   0  151572   6748   4340 S   0.3  0.0   0:05.65 sshd        
20038 root      20   0  347080   6876   5164 S   1.0  0.0   0:00.30 virsh       
   10 root      20   0       0      0      0 S   0.7  0.0   4:05.17 rcu_sched   
  143 root      20   0       0      0      0 S   0.3  0.0   0:02.72 kworker/7:1 
 7733 root      20   0   60564  10624   1716 S   0.3  0.0   0:02.41 ovsdb-serv+ 
    1 root      20   0  195252   8424   4220 S   0.0  0.0   0:36.97 systemd     
    2 root      20   0       0      0      0 S   0.0  0.0   0:00.08 kthreadd    
    3 root      20   0       0      0      0 S   0.0  0.0   0:00.42 ksoftirqd/0 
    5 root       0 -20       0      0      0 S   0.0  0.0   0:00.00 kworker/0:+ 
    8 root      rt   0       0      0      0 S   0.0  0.0   0:00.97 migration/0 
    9 root      20   0       0      0      0 S   0.0  0.0   0:00.00 rcu_bh      
   11 root       0 -20       0      0      0 S   0.0  0.0   0:00.00 lru-add-dr+

Comment 4 errata-xmlrpc 2019-04-29 09:26:47 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0898