Description of problem: cpu is high after create a lot of logical ports Version-Release number of selected component (if applicable): How reproducible: everytime Steps to Reproduce: 1.set up ovn environment,create another ovs bridge and set bridge-mapping 2.add a lot of logical switches,and add logical ports to it.add a lot of veth ports and set ipv4 and ipv6 address to them 3.add localnet port to it and use different vlan tag for external traffic 4.after the configurartion,the cpu goes high [root@dell-per740-04 ~]# top top - 23:25:17 up 5 days, 2:13, 2 users, load average: 3.36, 3.78, 3.39 Tasks: 565 total, 4 running, 561 sleeping, 0 stopped, 0 zombie %Cpu(s): 4.1 us, 1.3 sy, 0.0 ni, 92.8 id, 0.0 wa, 0.0 hi, 1.8 si, 0.0 st KiB Mem : 65213648 total, 30859008 free, 7398928 used, 26955712 buff/cache KiB Swap: 32767996 total, 32767996 free, 0 used. 54367732 avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 169046 openvsw+ 10 -10 4253016 1.6g 17936 R 207.9 2.6 61361:05 ovs-vswitchd 169120 root 10 -10 1534852 1.2g 1760 R 100.0 1.9 6363:19 ovn-controller 211 root 20 0 0 0 0 R 99.7 0.0 138:55.56 ksoftirqd/40 1114 root 20 0 134912 75128 74012 S 0.7 0.1 81:07.19 systemd-journal 1 root 20 0 216728 29960 4228 S 0.3 0.0 0:50.41 systemd 9 root 20 0 0 0 0 S 0.3 0.0 28:14.41 rcu_sched 1617 root 20 0 21928 1640 996 S 0.3 0.0 4:56.83 irqbalance 273925 root 20 0 0 0 0 S 0.3 0.0 0:00.04 kworker/10:1 274063 root 20 0 162452 2760 1580 R 0.3 0.0 0:00.06 top 2 root 20 0 0 0 0 S 0.0 0.0 0:00.15 kthreadd 4 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 kworker/0:0H 6 root 20 0 0 0 0 S 0.0 0.0 0:05.00 ksoftirqd/0 someinfo in the ovs-vswitchd.log: 2019-09-04T03:36:28.190Z|177224|poll_loop|INFO|wakeup due to [POLLIN] on fd 22 (FIFO pipe:[24021198]) at ../lib/ovs-rcu.c:235 (99% CPU usage) 2019-09-04T03:36:28.191Z|177225|poll_loop|INFO|wakeup due to [POLLIN] on fd 22 (FIFO pipe:[24021198]) at ../lib/ovs-rcu.c:235 (99% CPU usage) 2019-09-04T03:36:29.193Z|92360|ovs_rcu(urcu6)|WARN|blocked 1002 ms waiting for main to quiesce 2019-09-04T03:36:29.214Z|40325|poll_loop(revalidator99)|INFO|wakeup due to 376-ms timeout at ../ofproto/ofproto-dpif-upcall.c:982 (53% CPU usage) 2019-09-04T03:36:29.326Z|40326|poll_loop(revalidator99)|INFO|wakeup due to [POLLIN] on fd 75 (FIFO pipe:[24021531]) at ../lib/ovs-thread.c:311 (53% CPU usage) 2019-09-04T03:36:29.715Z|40327|poll_loop(revalidator99)|INFO|wakeup due to 375-ms timeout at ../ofproto/ofproto-dpif-upcall.c:982 (53% CPU usage) 2019-09-04T03:36:30.192Z|92361|ovs_rcu(urcu6)|WARN|blocked 2001 ms waiting for main to quiesce 2019-09-04T03:36:31.520Z|40328|timeval(revalidator99)|WARN|Unreasonably long 1305ms poll interval (0ms user, 1235ms system) 2019-09-04T03:36:31.520Z|40329|timeval(revalidator99)|WARN|faults: 7 minor, 0 major 2019-09-04T03:36:31.520Z|40330|timeval(revalidator99)|WARN|context switches: 2280 voluntary, 2 involuntary 2019-09-04T03:36:32.191Z|92362|ovs_rcu(urcu6)|WARN|blocked 4000 ms waiting for main to quiesce 2019-09-04T03:36:36.191Z|92363|ovs_rcu(urcu6)|WARN|blocked 8000 ms waiting for main to quiesce 2019-09-04T03:36:44.192Z|92364|ovs_rcu(urcu6)|WARN|blocked 16000 ms waiting for main to quiesce Actual results: Expected results: Additional info: Use this case to reproduce the issue.It is needed to change the loop circle from 10*10 to 200*200,so there will be 4000 logical switches and veth ports. http://pkgs.devel.redhat.com/cgit/tests/kernel/tree/networking/openvswitch/ovn function name "ovn_multi_vlan"
Hi Haidong Li, Since the issue is seen with ovs-vswitchd as well, would you mind cloning this bug to openvswitch component as well ?
Have copied this bug to bz1749840 on openvswitch2.11 component and to bz1749610 on ovs2.9
I logged into the setup and looked into a bit. Something seems wrong with ovs-vswitchd. ovn-controller is breaking connection with the ovs-vswitchd (openflow connection) and it is reconnecting all the time. That is why we are seeing high cpu usage in ovn-controller. Looks like we need to investigate ovs-vswitchd and see what is going on there. I don't think this is OVN issue.
Can you still reproduce this? If yes, can I have access to the system while reproducing the issue?
the explanation is described in https://bugzilla.redhat.com/show_bug.cgi?id=1749840#c7, I think this bug can be closed