This bug was initially created as a copy of Bug #1695722 I am copying this bug because: This bug was initially created as a copy of Bug #1677616 I am copying this bug because: Description of problem: Customer is facing delay on the security groups being applied """ There is a delay in applying security groups on instances created by a heat template. Thus there is no connection between instances. After some time (about 30 min) it starts working. Investigation revealed high CPU usage by the ovn-controller. """ I found there is alot of flows, on br-int ( Feb 12 ) $ wc -l ovs-ofctl_-O_OpenFlow13_dump-flows_br-int 329665 ovs-ofctl_-O_OpenFlow13_dump-flows_br-int This can be found on the journal: Feb 11 16:17:51 cnifthc02cmp004 journal: 2019-02-11T21:17:51Z|1519076|timeval|WARN|Unreasonably long 5381ms poll interval (4889ms user, 492ms system) Feb 11 16:17:51 cnifthc02cmp004 journal: 2019-02-11T21:17:51Z|1519077|timeval|WARN|faults: 571166 minor, 0 major Feb 11 16:17:51 cnifthc02cmp004 journal: 2019-02-11T21:17:51Z|1519078|timeval|WARN|context switches: 0 voluntary, 7 involuntary Feb 11 16:17:51 cnifthc02cmp004 journal: 2019-02-11T21:17:51Z|1519079|poll_loop|INFO|Dropped 1 log messages in last 5 seconds (most recently, 5 seconds ago) due to excessive rate Feb 11 16:17:51 cnifthc02cmp004 journal: 2019-02-11T21:17:51Z|1519080|poll_loop|INFO|wakeup due to [POLLIN] on fd 8 (192.168.105.14:43410<->192.168.105.251:6642) at lib/stream-fd.c:157 (101% CPU usage) Feb 11 16:17:51 cnifthc02cmp004 journal: 2019-02-11T21:17:51Z|1519081|pinctrl|INFO|Dropped 2 log messages in last 5 seconds (most recently, 5 seconds ago) due to excessive rate Feb 11 16:17:51 cnifthc02cmp004 journal: 2019-02-11T21:17:51Z|1519082|pinctrl|INFO|DHCPACK fa:16:3e:89:44:2a 10.11.154.160 Feb 11 16:17:56 cnifthc02cmp004 journal: 2019-02-11T21:17:56Z|1519083|timeval|WARN|Unreasonably long 5252ms poll interval (4903ms user, 346ms system) Feb 11 16:17:56 cnifthc02cmp004 journal: 2019-02-11T21:17:56Z|1519084|timeval|WARN|faults: 403610 minor, 0 major Feb 11 16:17:56 cnifthc02cmp004 journal: 2019-02-11T21:17:56Z|1519085|timeval|WARN|context switches: 0 voluntary, 9 involuntary Feb 11 16:17:56 cnifthc02cmp004 journal: 2019-02-11T21:17:56Z|1519086|poll_loop|INFO|Dropped 1 log messages in last 5 seconds (most recently, 5 seconds ago) due to excessive rate Feb 11 16:17:56 cnifthc02cmp004 journal: 2019-02-11T21:17:56Z|1519087|poll_loop|INFO|wakeup due to [POLLIN] on fd 8 (192.168.105.14:43410<->192.168.105.251:6642) at lib/stream-fd.c:157 (101% CPU usage) Feb 11 16:17:56 cnifthc02cmp004 journal: 2019-02-11T21:17:56Z|1519088|pinctrl|INFO|DHCPACK fa:16:3e:e5:df:b6 10.11.63.32 Feb 11 16:17:59 cnifthc02cmp004 dhclient[742515]: DHCPDISCOVER on enp7s0f0 to 255.255.255.255 port 67 interval 20 (xid=0x1fa5a1b6) Feb 11 16:18:01 cnifthc02cmp004 journal: 2019-02-11T21:18:01Z|1519089|timeval|WARN|Unreasonably long 5054ms poll interval (4739ms user, 314ms system) Feb 11 16:18:01 cnifthc02cmp004 journal: 2019-02-11T21:18:01Z|1519090|timeval|WARN|faults: 313461 minor, 0 major Feb 11 16:18:01 cnifthc02cmp004 journal: 2019-02-11T21:18:01Z|1519091|timeval|WARN|context switches: 0 voluntary, 4 involuntary Feb 11 16:18:01 cnifthc02cmp004 journal: 2019-02-11T21:18:01Z|1519092|poll_loop|INFO|Dropped 1 log messages in last 5 seconds (most recently, 5 seconds ago) due to excessive rate Feb 11 16:18:01 cnifthc02cmp004 journal: 2019-02-11T21:18:01Z|1519093|poll_loop|INFO|wakeup due to [POLLIN] on fd 8 (192.168.105.14:43410<->192.168.105.251:6642) at lib/stream-fd.c:157 (101% CPU usage) .. .. Feb 12 01:43:17 cnifthc02cmp004 journal: 2019-02-12T06:43:17Z|1561264|timeval|WARN|Unreasonably long 4598ms poll interval (4585ms user, 13ms system) Feb 12 01:43:17 cnifthc02cmp004 journal: 2019-02-12T06:43:17Z|1561265|timeval|WARN|context switches: 0 voluntary, 1 involuntary Feb 12 01:43:17 cnifthc02cmp004 journal: 2019-02-12T06:43:17Z|1561266|poll_loop|INFO|Dropped 1 log messages in last 5 seconds (most recently, 5 seconds ago) due to excessive rate Feb 12 01:43:17 cnifthc02cmp004 journal: 2019-02-12T06:43:17Z|1561267|poll_loop|INFO|wakeup due to [POLLIN] on fd 8 (192.168.105.14:58120<->192.168.105.251:6642) at lib/stream-fd.c:157 (100% CPU usage) Feb 12 01:43:17 cnifthc02cmp004 journal: 2019-02-12T06:43:17Z|1561268|pinctrl|INFO|Dropped 2 log messages in last 5 seconds (most recently, 5 seconds ago) due to excessive rate Feb 12 01:43:17 cnifthc02cmp004 journal: 2019-02-12T06:43:17Z|1561269|pinctrl|INFO|DHCPACK fa:16:3e:40:d5:92 10.11.151.184 Feb 12 01:43:22 cnifthc02cmp004 journal: 2019-02-12T06:43:22Z|1561270|timeval|WARN|Unreasonably long 4713ms poll interval (4713ms user, 0ms system) Feb 12 01:43:22 cnifthc02cmp004 journal: 2019-02-12T06:43:22Z|1561271|timeval|WARN|faults: 52678 minor, 0 major Feb 12 01:43:22 cnifthc02cmp004 journal: 2019-02-12T06:43:22Z|1561272|timeval|WARN|context switches: 0 voluntary, 3 involuntary Feb 12 01:43:22 cnifthc02cmp004 journal: 2019-02-12T06:43:22Z|1561273|pinctrl|INFO|Dropped 1 log messages in last 4 seconds (most recently, 4 seconds ago) due to excessive rate Feb 12 01:43:22 cnifthc02cmp004 journal: 2019-02-12T06:43:22Z|1561274|pinctrl|INFO|DHCPACK fa:16:3e:e5:df:b6 10.11.63.32 Feb 12 01:43:22 cnifthc02cmp004 journal: 2019-02-12T06:43:22Z|1561275|pinctrl|INFO|DHCPACK fa:16:3e:e5:df:b6 10.11.63.32 These are number from today, when cu. faced the issue again: [root@cnifthc02ctl001 heat-admin]# docker exec -itu root bde0cda6a714 /bin/bash ()[root@ctl001 /]# ovn-nbctl list ACL |grep uuid|wc -l 5051 ()[root@ctl001 /]# ovn-nbctl list Logical_Switch | grep uuid | wc -l 62 ()[root@ctl001 /]# ovn-nbctl list Logical_Router | grep uuid | wc -l 20 [rootctl001 /]# ovn-nbctl list Logical_Switch_Port | grep uuid | wc -l 501 Version-Release number of selected component (if applicable): OSP13 How reproducible: Second time it happens Steps to Reproduce: 1. unsure 2. 3. Actual results: ovs high cpu usage. Huge delay on security groups applied Expected results: none of the actual results Additional info:
reproduced in the old version: run the scripts: from scapy.all import *; sendp(Ether(src="00:de:ad:01:00:01", dst="ff:ff:ff:ff:ff:ff")/IP(src="0.0.0.0", dst="255.255.255.255")/UDP(sport=68,dport=67)/BOOTP(chaddr="00:de:ad:01:00:01",ciaddr="0.0.0.0")/DHCP(options=[("message-type","discover")])/DHCP(options=[("subnet_mask","255.255.255.0")])/DHCP(options=[("router","0.0.0.0")])/DHCP(options=[("name_server", "0.0.0.0"), "end"]), iface="eth1") [root@hp-dl380pg8-04 opt]# [root@hp-dl380pg8-04 opt]# rpm -qa | grep openvswitch openvswitch-ovn-central-2.9.0-101.el7fdp.x86_64 openvswitch-selinux-extra-policy-1.0-11.el7fdp.noarch openvswitch-ovn-common-2.9.0-101.el7fdp.x86_64 openvswitch-ovn-host-2.9.0-101.el7fdp.x86_64 openvswitch-2.9.0-101.el7fdp.x86_64 kernel-kernel-networking-openvswitch-ovn-1.0-121.noarch top - 23:20:57 up 3 days, 2 min, 4 users, load average: 0.43, 0.26, 0.20 Tasks: 313 total, 3 running, 310 sleeping, 0 stopped, 0 zombie %Cpu(s): 5.5 us, 0.2 sy, 0.0 ni, 94.3 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st KiB Mem : 32802316 total, 26465500 free, 1545444 used, 4791372 buff/cache KiB Swap: 16515068 total, 16515068 free, 0 used. 30479248 avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 55952 root 10 -10 73684 22980 1372 R 100.0 0.1 4:32.46 ovn-contro+ 56363 qemu 20 0 2926928 774936 10232 S 44.5 2.4 1:21.13 qemu-kvm 55883 openvsw+ 10 -10 1921136 225832 13600 S 5.3 0.7 0:15.96 ovs-vswitc+ 15111 root 20 0 151572 6748 4340 S 1.0 0.0 0:05.10 sshd 55081 root 20 0 1546968 24168 14524 S 1.0 0.1 0:01.79 libvirtd 5750 root 20 0 347080 6876 5164 S 0.7 0.0 0:00.39 virsh 56551 root 20 0 0 0 0 S 0.7 0.0 0:00.18 vhost-56363 10 root 20 0 0 0 0 S 0.3 0.0 3:52.26 rcu_sched 5869 root 20 0 162144 2460 1592 R 0.3 0.0 0:00.05 top 17379 root 20 0 169616 26568 5424 S 0.3 0.1 0:27.75 beah-rhts-+ 1 root 20 0 195252 8424 4220 S 0.0 0.0 0:32.70 systemd 2 root 20 0 0 0 0 S 0.0 0.0 0:00.08 kthreadd 3 root 20 0 0 0 0 S 0.0 0.0 0:00.39 ksoftirqd/0 5 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 kworker/0:+ 8 root rt 0 0 0 0 S 0.0 0.0 0:00.94 migration/0 9 root 20 0 0 0 0 S 0.0 0.0 0:00.00 rcu_bh 11 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 lru-add-dr+ [root@hp-dl380pg8-04 opt]# cat /var/log/openvswitch/ovn-controller.log | grep CPU 2019-04-12T03:20:14.019Z|00155|poll_loop|INFO|wakeup due to [POLLIN] on fd 15 (<->/var/run/openvswitch/br-int.mgmt) at ../lib/stream-fd.c:157 (75% CPU usage) 2019-04-12T03:20:18.479Z|00161|poll_loop|INFO|wakeup due to [POLLIN] on fd 15 (<->/var/run/openvswitch/br-int.mgmt) at ../lib/stream-fd.c:157 (103% CPU usage) 2019-04-12T03:20:24.500Z|00167|poll_loop|INFO|wakeup due to [POLLIN] on fd 15 (<->/var/run/openvswitch/br-int.mgmt) at ../lib/stream-fd.c:157 (100% CPU usage) 2019-04-12T03:20:50.212Z|00180|poll_loop|INFO|wakeup due to [POLLIN] on fd 15 (<->/var/run/openvswitch/br-int.mgmt) at ../lib/stream-fd.c:157 (102% CPU usage) 2019-04-12T03:20:50.426Z|00181|poll_loop|INFO|wakeup due to [POLLIN] on fd 15 (<->/var/run/openvswitch/br-int.mgmt) at ../lib/stream-fd.c:157 (102% CPU usage) 2019-04-12T03:20:50.639Z|00182|poll_loop|INFO|wakeup due to [POLLIN] on fd 15 (<->/var/run/openvswitch/br-int.mgmt) at ../lib/stream-fd.c:157 (102% CPU usage) 2019-04-12T03:20:50.851Z|00183|poll_loop|INFO|wakeup due to [POLLIN] on fd 15 (<->/var/run/openvswitch/br-int.mgmt) at ../lib/stream-fd.c:157 (102% CPU usage) 2019-04-12T03:20:54.463Z|00189|poll_loop|INFO|wakeup due to [POLLIN] on fd 15 (<->/var/run/openvswitch/br-int.mgmt) at ../lib/stream-fd.c:157 (102% CPU usage) 2019-04-12T03:21:00.523Z|00195|poll_loop|INFO|wakeup due to [POLLIN] on fd 15 (<->/var/run/openvswitch/br-int.mgmt) at ../lib/stream-fd.c:157 (101% CPU usage) ===================================================================================================================================== verified on the latest version: [root@hp-dl380pg8-04 opt]# rpm -qa | grep openvswitch openvswitch-ovn-central-2.9.0-104.el7fdp.x86_64 openvswitch-selinux-extra-policy-1.0-11.el7fdp.noarch openvswitch-ovn-common-2.9.0-104.el7fdp.x86_64 openvswitch-ovn-host-2.9.0-104.el7fdp.x86_64 openvswitch-2.9.0-104.el7fdp.x86_64 kernel-kernel-networking-openvswitch-ovn-1.0-121.noarch top - 01:30:08 up 3 days, 2:11, 4 users, load average: 0.33, 0.09, 0.13 Tasks: 313 total, 1 running, 312 sleeping, 0 stopped, 0 zombie %Cpu(s): 0.5 us, 0.2 sy, 0.0 ni, 99.3 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st KiB Mem : 32802316 total, 26291976 free, 1646568 used, 4863772 buff/cache KiB Swap: 16515068 total, 16515068 free, 0 used. 30367344 avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 8175 qemu 20 0 3081616 879096 10236 S 15.3 2.7 1:43.01 qemu-kvm 7695 openvsw+ 10 -10 1921136 226484 13600 S 2.0 0.7 0:41.80 ovs-vswitc+ 3751 root 20 0 0 0 0 S 0.3 0.0 0:17.86 xfsaild/dm+ 6725 root 20 0 1546968 24152 14524 S 0.3 0.1 0:01.65 libvirtd 7645 openvsw+ 10 -10 55016 2764 1608 S 0.3 0.0 0:00.56 ovsdb-serv+ 15111 root 20 0 151572 6748 4340 S 0.3 0.0 0:05.65 sshd 20038 root 20 0 347080 6876 5164 S 1.0 0.0 0:00.30 virsh 10 root 20 0 0 0 0 S 0.7 0.0 4:05.17 rcu_sched 143 root 20 0 0 0 0 S 0.3 0.0 0:02.72 kworker/7:1 7733 root 20 0 60564 10624 1716 S 0.3 0.0 0:02.41 ovsdb-serv+ 1 root 20 0 195252 8424 4220 S 0.0 0.0 0:36.97 systemd 2 root 20 0 0 0 0 S 0.0 0.0 0:00.08 kthreadd 3 root 20 0 0 0 0 S 0.0 0.0 0:00.42 ksoftirqd/0 5 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 kworker/0:+ 8 root rt 0 0 0 0 S 0.0 0.0 0:00.97 migration/0 9 root 20 0 0 0 0 S 0.0 0.0 0:00.00 rcu_bh 11 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 lru-add-dr+
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:0898