Bug 1695725

Summary: ovn-controller is using 100% cpu on OVN deployment, delay on security groups being applied
Product: Red Hat Enterprise Linux Fast Datapath Reporter: Timothy Redaelli <tredaelli>
Component: openvswitchAssignee: Timothy Redaelli <tredaelli>
openvswitch sub component: daemons and tools QA Contact: haidong li <haili>
Status: CLOSED ERRATA Docs Contact:
Severity: unspecified    
Priority: unspecified CC: atragler, ctrautma, jhsiao, pmorey, pvauter, qding, yinxu
Version: FDP 19.B   
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: openvswitch-2.9.0-103.el7fdn Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-04-29 09:26:47 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Timothy Redaelli 2019-04-03 16:58:53 UTC
This bug was initially created as a copy of Bug #1695722

I am copying this bug because: 



This bug was initially created as a copy of Bug #1677616

I am copying this bug because: 



Description of problem:

Customer is facing delay on the security groups being applied 

"""
There is a delay in applying security groups on instances created by a heat template. Thus there is no connection between instances. After some time (about 30 min) it starts working.
Investigation revealed high CPU usage by the ovn-controller.
"""

I found there is alot of flows, on br-int ( Feb 12 )
$ wc -l ovs-ofctl_-O_OpenFlow13_dump-flows_br-int 

329665 ovs-ofctl_-O_OpenFlow13_dump-flows_br-int

This can be found on the journal:

Feb 11 16:17:51 cnifthc02cmp004 journal: 2019-02-11T21:17:51Z|1519076|timeval|WARN|Unreasonably long 5381ms poll interval (4889ms user, 492ms system)
Feb 11 16:17:51 cnifthc02cmp004 journal: 2019-02-11T21:17:51Z|1519077|timeval|WARN|faults: 571166 minor, 0 major
Feb 11 16:17:51 cnifthc02cmp004 journal: 2019-02-11T21:17:51Z|1519078|timeval|WARN|context switches: 0 voluntary, 7 involuntary
Feb 11 16:17:51 cnifthc02cmp004 journal: 2019-02-11T21:17:51Z|1519079|poll_loop|INFO|Dropped 1 log messages in last 5 seconds (most recently, 5 seconds ago) due to excessive rate
Feb 11 16:17:51 cnifthc02cmp004 journal: 2019-02-11T21:17:51Z|1519080|poll_loop|INFO|wakeup due to [POLLIN] on fd 8 (192.168.105.14:43410<->192.168.105.251:6642) at lib/stream-fd.c:157 (101% CPU usage)
Feb 11 16:17:51 cnifthc02cmp004 journal: 2019-02-11T21:17:51Z|1519081|pinctrl|INFO|Dropped 2 log messages in last 5 seconds (most recently, 5 seconds ago) due to excessive rate
Feb 11 16:17:51 cnifthc02cmp004 journal: 2019-02-11T21:17:51Z|1519082|pinctrl|INFO|DHCPACK fa:16:3e:89:44:2a 10.11.154.160
Feb 11 16:17:56 cnifthc02cmp004 journal: 2019-02-11T21:17:56Z|1519083|timeval|WARN|Unreasonably long 5252ms poll interval (4903ms user, 346ms system)
Feb 11 16:17:56 cnifthc02cmp004 journal: 2019-02-11T21:17:56Z|1519084|timeval|WARN|faults: 403610 minor, 0 major
Feb 11 16:17:56 cnifthc02cmp004 journal: 2019-02-11T21:17:56Z|1519085|timeval|WARN|context switches: 0 voluntary, 9 involuntary
Feb 11 16:17:56 cnifthc02cmp004 journal: 2019-02-11T21:17:56Z|1519086|poll_loop|INFO|Dropped 1 log messages in last 5 seconds (most recently, 5 seconds ago) due to excessive rate
Feb 11 16:17:56 cnifthc02cmp004 journal: 2019-02-11T21:17:56Z|1519087|poll_loop|INFO|wakeup due to [POLLIN] on fd 8 (192.168.105.14:43410<->192.168.105.251:6642) at lib/stream-fd.c:157 (101% CPU usage)
Feb 11 16:17:56 cnifthc02cmp004 journal: 2019-02-11T21:17:56Z|1519088|pinctrl|INFO|DHCPACK fa:16:3e:e5:df:b6 10.11.63.32
Feb 11 16:17:59 cnifthc02cmp004 dhclient[742515]: DHCPDISCOVER on enp7s0f0 to 255.255.255.255 port 67 interval 20 (xid=0x1fa5a1b6)
Feb 11 16:18:01 cnifthc02cmp004 journal: 2019-02-11T21:18:01Z|1519089|timeval|WARN|Unreasonably long 5054ms poll interval (4739ms user, 314ms system)
Feb 11 16:18:01 cnifthc02cmp004 journal: 2019-02-11T21:18:01Z|1519090|timeval|WARN|faults: 313461 minor, 0 major
Feb 11 16:18:01 cnifthc02cmp004 journal: 2019-02-11T21:18:01Z|1519091|timeval|WARN|context switches: 0 voluntary, 4 involuntary
Feb 11 16:18:01 cnifthc02cmp004 journal: 2019-02-11T21:18:01Z|1519092|poll_loop|INFO|Dropped 1 log messages in last 5 seconds (most recently, 5 seconds ago) due to excessive rate
Feb 11 16:18:01 cnifthc02cmp004 journal: 2019-02-11T21:18:01Z|1519093|poll_loop|INFO|wakeup due to [POLLIN] on fd 8 (192.168.105.14:43410<->192.168.105.251:6642) at lib/stream-fd.c:157 (101% CPU usage)
..
..
Feb 12 01:43:17 cnifthc02cmp004 journal: 2019-02-12T06:43:17Z|1561264|timeval|WARN|Unreasonably long 4598ms poll interval (4585ms user, 13ms system)
Feb 12 01:43:17 cnifthc02cmp004 journal: 2019-02-12T06:43:17Z|1561265|timeval|WARN|context switches: 0 voluntary, 1 involuntary
Feb 12 01:43:17 cnifthc02cmp004 journal: 2019-02-12T06:43:17Z|1561266|poll_loop|INFO|Dropped 1 log messages in last 5 seconds (most recently, 5 seconds ago) due to excessive rate
Feb 12 01:43:17 cnifthc02cmp004 journal: 2019-02-12T06:43:17Z|1561267|poll_loop|INFO|wakeup due to [POLLIN] on fd 8 (192.168.105.14:58120<->192.168.105.251:6642) at lib/stream-fd.c:157 (100% CPU usage)
Feb 12 01:43:17 cnifthc02cmp004 journal: 2019-02-12T06:43:17Z|1561268|pinctrl|INFO|Dropped 2 log messages in last 5 seconds (most recently, 5 seconds ago) due to excessive rate
Feb 12 01:43:17 cnifthc02cmp004 journal: 2019-02-12T06:43:17Z|1561269|pinctrl|INFO|DHCPACK fa:16:3e:40:d5:92 10.11.151.184
Feb 12 01:43:22 cnifthc02cmp004 journal: 2019-02-12T06:43:22Z|1561270|timeval|WARN|Unreasonably long 4713ms poll interval (4713ms user, 0ms system)
Feb 12 01:43:22 cnifthc02cmp004 journal: 2019-02-12T06:43:22Z|1561271|timeval|WARN|faults: 52678 minor, 0 major
Feb 12 01:43:22 cnifthc02cmp004 journal: 2019-02-12T06:43:22Z|1561272|timeval|WARN|context switches: 0 voluntary, 3 involuntary
Feb 12 01:43:22 cnifthc02cmp004 journal: 2019-02-12T06:43:22Z|1561273|pinctrl|INFO|Dropped 1 log messages in last 4 seconds (most recently, 4 seconds ago) due to excessive rate
Feb 12 01:43:22 cnifthc02cmp004 journal: 2019-02-12T06:43:22Z|1561274|pinctrl|INFO|DHCPACK fa:16:3e:e5:df:b6 10.11.63.32
Feb 12 01:43:22 cnifthc02cmp004 journal: 2019-02-12T06:43:22Z|1561275|pinctrl|INFO|DHCPACK fa:16:3e:e5:df:b6 10.11.63.32


These are number from today, when cu. faced the issue again:

[root@cnifthc02ctl001 heat-admin]# docker exec -itu root bde0cda6a714 /bin/bash
()[root@ctl001 /]# ovn-nbctl list ACL |grep uuid|wc -l
5051
()[root@ctl001 /]# ovn-nbctl list Logical_Switch | grep uuid | wc -l
62
()[root@ctl001 /]# ovn-nbctl list Logical_Router | grep uuid | wc -l
20
[rootctl001 /]# ovn-nbctl list Logical_Switch_Port | grep uuid | wc -l
501

Version-Release number of selected component (if applicable):
OSP13

How reproducible:
Second time it happens

Steps to Reproduce:
1. unsure
2.
3.

Actual results:
ovs high cpu usage.
Huge delay on security groups applied

Expected results:
none of the actual results

Additional info:

Comment 2 haidong li 2019-04-12 05:36:42 UTC
reproduced in the old version:

run the scripts:
from scapy.all import *;
sendp(Ether(src="00:de:ad:01:00:01", dst="ff:ff:ff:ff:ff:ff")/IP(src="0.0.0.0", dst="255.255.255.255")/UDP(sport=68,dport=67)/BOOTP(chaddr="00:de:ad:01:00:01",ciaddr="0.0.0.0")/DHCP(options=[("message-type","discover")])/DHCP(options=[("subnet_mask","255.255.255.0")])/DHCP(options=[("router","0.0.0.0")])/DHCP(options=[("name_server", "0.0.0.0"), "end"]), iface="eth1")

[root@hp-dl380pg8-04 opt]# 
[root@hp-dl380pg8-04 opt]# rpm -qa | grep openvswitch
openvswitch-ovn-central-2.9.0-101.el7fdp.x86_64
openvswitch-selinux-extra-policy-1.0-11.el7fdp.noarch
openvswitch-ovn-common-2.9.0-101.el7fdp.x86_64
openvswitch-ovn-host-2.9.0-101.el7fdp.x86_64
openvswitch-2.9.0-101.el7fdp.x86_64
kernel-kernel-networking-openvswitch-ovn-1.0-121.noarch

top - 23:20:57 up 3 days, 2 min,  4 users,  load average: 0.43, 0.26, 0.20
Tasks: 313 total,   3 running, 310 sleeping,   0 stopped,   0 zombie
%Cpu(s):  5.5 us,  0.2 sy,  0.0 ni, 94.3 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem : 32802316 total, 26465500 free,  1545444 used,  4791372 buff/cache
KiB Swap: 16515068 total, 16515068 free,        0 used. 30479248 avail Mem 

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND     
55952 root      10 -10   73684  22980   1372 R 100.0  0.1   4:32.46 ovn-contro+ 
56363 qemu      20   0 2926928 774936  10232 S  44.5  2.4   1:21.13 qemu-kvm    
55883 openvsw+  10 -10 1921136 225832  13600 S   5.3  0.7   0:15.96 ovs-vswitc+ 
15111 root      20   0  151572   6748   4340 S   1.0  0.0   0:05.10 sshd        
55081 root      20   0 1546968  24168  14524 S   1.0  0.1   0:01.79 libvirtd    
 5750 root      20   0  347080   6876   5164 S   0.7  0.0   0:00.39 virsh       
56551 root      20   0       0      0      0 S   0.7  0.0   0:00.18 vhost-56363 
   10 root      20   0       0      0      0 S   0.3  0.0   3:52.26 rcu_sched   
 5869 root      20   0  162144   2460   1592 R   0.3  0.0   0:00.05 top         
17379 root      20   0  169616  26568   5424 S   0.3  0.1   0:27.75 beah-rhts-+ 
    1 root      20   0  195252   8424   4220 S   0.0  0.0   0:32.70 systemd     
    2 root      20   0       0      0      0 S   0.0  0.0   0:00.08 kthreadd    
    3 root      20   0       0      0      0 S   0.0  0.0   0:00.39 ksoftirqd/0 
    5 root       0 -20       0      0      0 S   0.0  0.0   0:00.00 kworker/0:+ 
    8 root      rt   0       0      0      0 S   0.0  0.0   0:00.94 migration/0 
    9 root      20   0       0      0      0 S   0.0  0.0   0:00.00 rcu_bh      
   11 root       0 -20       0      0      0 S   0.0  0.0   0:00.00 lru-add-dr+ 

[root@hp-dl380pg8-04 opt]# cat /var/log/openvswitch/ovn-controller.log | grep CPU
2019-04-12T03:20:14.019Z|00155|poll_loop|INFO|wakeup due to [POLLIN] on fd 15 (<->/var/run/openvswitch/br-int.mgmt) at ../lib/stream-fd.c:157 (75% CPU usage)
2019-04-12T03:20:18.479Z|00161|poll_loop|INFO|wakeup due to [POLLIN] on fd 15 (<->/var/run/openvswitch/br-int.mgmt) at ../lib/stream-fd.c:157 (103% CPU usage)
2019-04-12T03:20:24.500Z|00167|poll_loop|INFO|wakeup due to [POLLIN] on fd 15 (<->/var/run/openvswitch/br-int.mgmt) at ../lib/stream-fd.c:157 (100% CPU usage)
2019-04-12T03:20:50.212Z|00180|poll_loop|INFO|wakeup due to [POLLIN] on fd 15 (<->/var/run/openvswitch/br-int.mgmt) at ../lib/stream-fd.c:157 (102% CPU usage)
2019-04-12T03:20:50.426Z|00181|poll_loop|INFO|wakeup due to [POLLIN] on fd 15 (<->/var/run/openvswitch/br-int.mgmt) at ../lib/stream-fd.c:157 (102% CPU usage)
2019-04-12T03:20:50.639Z|00182|poll_loop|INFO|wakeup due to [POLLIN] on fd 15 (<->/var/run/openvswitch/br-int.mgmt) at ../lib/stream-fd.c:157 (102% CPU usage)
2019-04-12T03:20:50.851Z|00183|poll_loop|INFO|wakeup due to [POLLIN] on fd 15 (<->/var/run/openvswitch/br-int.mgmt) at ../lib/stream-fd.c:157 (102% CPU usage)
2019-04-12T03:20:54.463Z|00189|poll_loop|INFO|wakeup due to [POLLIN] on fd 15 (<->/var/run/openvswitch/br-int.mgmt) at ../lib/stream-fd.c:157 (102% CPU usage)
2019-04-12T03:21:00.523Z|00195|poll_loop|INFO|wakeup due to [POLLIN] on fd 15 (<->/var/run/openvswitch/br-int.mgmt) at ../lib/stream-fd.c:157 (101% CPU usage)

=====================================================================================================================================
verified on the latest version:

[root@hp-dl380pg8-04 opt]# rpm -qa | grep openvswitch
openvswitch-ovn-central-2.9.0-104.el7fdp.x86_64
openvswitch-selinux-extra-policy-1.0-11.el7fdp.noarch
openvswitch-ovn-common-2.9.0-104.el7fdp.x86_64
openvswitch-ovn-host-2.9.0-104.el7fdp.x86_64
openvswitch-2.9.0-104.el7fdp.x86_64
kernel-kernel-networking-openvswitch-ovn-1.0-121.noarch

top - 01:30:08 up 3 days,  2:11,  4 users,  load average: 0.33, 0.09, 0.13
Tasks: 313 total,   1 running, 312 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.5 us,  0.2 sy,  0.0 ni, 99.3 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem : 32802316 total, 26291976 free,  1646568 used,  4863772 buff/cache
KiB Swap: 16515068 total, 16515068 free,        0 used. 30367344 avail Mem 

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND     
 8175 qemu      20   0 3081616 879096  10236 S  15.3  2.7   1:43.01 qemu-kvm    
 7695 openvsw+  10 -10 1921136 226484  13600 S   2.0  0.7   0:41.80 ovs-vswitc+ 
 3751 root      20   0       0      0      0 S   0.3  0.0   0:17.86 xfsaild/dm+ 
 6725 root      20   0 1546968  24152  14524 S   0.3  0.1   0:01.65 libvirtd    
 7645 openvsw+  10 -10   55016   2764   1608 S   0.3  0.0   0:00.56 ovsdb-serv+ 
15111 root      20   0  151572   6748   4340 S   0.3  0.0   0:05.65 sshd        
20038 root      20   0  347080   6876   5164 S   1.0  0.0   0:00.30 virsh       
   10 root      20   0       0      0      0 S   0.7  0.0   4:05.17 rcu_sched   
  143 root      20   0       0      0      0 S   0.3  0.0   0:02.72 kworker/7:1 
 7733 root      20   0   60564  10624   1716 S   0.3  0.0   0:02.41 ovsdb-serv+ 
    1 root      20   0  195252   8424   4220 S   0.0  0.0   0:36.97 systemd     
    2 root      20   0       0      0      0 S   0.0  0.0   0:00.08 kthreadd    
    3 root      20   0       0      0      0 S   0.0  0.0   0:00.42 ksoftirqd/0 
    5 root       0 -20       0      0      0 S   0.0  0.0   0:00.00 kworker/0:+ 
    8 root      rt   0       0      0      0 S   0.0  0.0   0:00.97 migration/0 
    9 root      20   0       0      0      0 S   0.0  0.0   0:00.00 rcu_bh      
   11 root       0 -20       0      0      0 S   0.0  0.0   0:00.00 lru-add-dr+

Comment 4 errata-xmlrpc 2019-04-29 09:26:47 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0898