Description of problem: We have a 3 controller+3 ceph + 290 compute node baremetal environment for scale testing. When we run the rally scenario that boots vms on a network https://github.com/cloud-bulldozer/browbeat/blob/master/rally/nova/nova-boot-list-cc.yml we see some VM boots taking exceptionally long and eventually timing out due to VIF plugging timeouts. Two scenarios were tested Create 1000 CentOS VMs with network at a rally concurrency of 16 Median time for nova.boot_server = 10s 95% time for nova.boot_server = 21s Create 1000 CentOS VMs without network at a rally concurrency of 16 Median time for nova.boot_server = 34s 95% time for nova.boot_server = 243s AS we can see, booting a server with network takes siginicantly longer than without network which is understandable due to the additional processing and signalling involved with VIF Plugging, but the 95% of boot times for VMs with network is *extremely* high at 243s. Also, out of the 1000 VMs, 30 VMs went into ERROR as they reached the vif_plugging_timeout of 300s. ============================================================================= On the compute nodes with this issue, examining the logs shows 100% cpu usage by ovn controller and messages like 2020-02-03T00:02:35.918Z|03631|timeval|WARN|faults: 2021939 minor, 0 major 2020-02-03T00:02:35.918Z|03632|timeval|WARN|context switches: 0 voluntary, 21 involuntary 2020-02-03T00:02:35.918Z|03633|coverage|INFO|Dropped 2 log messages in last 41 seconds (most recently, 20 seconds ago) due to excessive rate 2020-02-03T00:02:35.918Z|03634|coverage|INFO|Skipping details of duplicate event coverage for hash=8244bb6c 2020-02-03T00:02:35.918Z|03635|poll_loop|INFO|wakeup due to 0-ms timeout at tcp:172.25.1.136:6642 (100% CPU usage) 2020-02-03T00:02:56.188Z|03636|timeval|WARN|Unreasonably long 20269ms poll interval (17152ms user, 3047ms system) 2020-02-03T00:02:56.188Z|03637|timeval|WARN|faults: 2085525 minor, 0 major 2020-02-03T00:02:56.188Z|03638|timeval|WARN|context switches: 0 voluntary, 23 involuntary 2020-02-03T00:02:56.188Z|03639|poll_loop|INFO|wakeup due to [POLLIN] on fd 20 (<->/var/run/openvswitch/br-int.mgmt) at lib/stream-fd.c:157 (99% CPU usage) 2020-02-03T00:03:17.002Z|03640|timeval|WARN|Unreasonably long 20814ms poll interval (17689ms user, 3052ms system) 2020-02-03T00:03:17.002Z|03641|timeval|WARN|faults: 2143100 minor, 0 major 2020-02-03T00:03:17.002Z|03642|timeval|WARN|context switches: 2 voluntary, 22 involuntary 2020-02-03T00:03:17.002Z|03643|poll_loop|INFO|wakeup due to [POLLIN] on fd 20 (<->/var/run/openvswitch/br-int.mgmt) at lib/stream-fd.c:157 (99% CPU usage) 2020-02-03T00:03:37.157Z|03644|timeval|WARN|Unreasonably long 20155ms poll interval (17127ms user, 2960ms system) 2020-02-03T00:03:37.157Z|03645|timeval|WARN|faults: 2021933 minor, 0 major 2020-02-03T00:03:37.157Z|03646|timeval|WARN|context switches: 0 voluntary, 23 involuntary 2020-02-03T00:03:37.157Z|03647|coverage|INFO|Dropped 2 log messages in last 41 seconds (most recently, 20 seconds ago) due to excessive rate 2020-02-03T00:03:37.157Z|03648|coverage|INFO|Skipping details of duplicate event coverage for hash=8244bb6c 2020-02-03T00:03:37.157Z|03649|poll_loop|INFO|wakeup due to [POLLIN] on fd 19 (172.25.1.208:58946<->172.25.1.136:6642) at lib/stream-fd.c:157 (99% CPU usage) 2020-02-03T00:03:37.275Z|03650|poll_loop|INFO|wakeup due to 0-ms timeout at ovn/controller/ovn-controller.c:2057 (99% CPU usage) 2020-02-03T00:03:37.383Z|03651|poll_loop|INFO|wakeup due to 0-ms timeout at ovn/controller/ovn-controller.c:2057 (99% CPU usage) 2020-02-03T00:03:37.491Z|03652|poll_loop|INFO|wakeup due to 0-ms timeout at ovn/controller/ovn-controller.c:2057 (99% CPU usage) 2020-02-03T00:03:37.599Z|03653|poll_loop|INFO|wakeup due to 0-ms timeout at ovn/controller/ovn-controller.c:2057 (99% CPU usage) 2020-02-03T00:03:37.707Z|03654|poll_loop|INFO|wakeup due to 0-ms timeout at ovn/controller/ovn-controller.c:2057 (99% CPU usage) 2020-02-03T00:03:37.815Z|03655|poll_loop|INFO|wakeup due to 0-ms timeout at ovn/controller/ovn-controller.c:2057 (99% CPU usage) 2020-02-03T00:03:37.923Z|03656|poll_loop|INFO|wakeup due to 0-ms timeout at ovn/controller/ovn-controller.c:2057 (99% CPU usage) 2020-02-03T00:03:38.030Z|03657|poll_loop|INFO|wakeup due to 0-ms timeout at ovn/controller/ovn-controller.c:2057 (99% CPU usage) 2020-02-03T00:03:38.137Z|03658|poll_loop|INFO|wakeup due to 0-ms timeout at ovn/controller/ovn-controller.c:2057 (99% CPU usage) 2020-02-03T00:03:38.246Z|03659|poll_loop|INFO|wakeup due to 0-ms timeout at ovn/controller/ovn-controller.c:2057 (99% CPU usage) 2020-02-03T00:03:38.353Z|03660|poll_loop|INFO|wakeup due to 0-ms timeout at ovn/controller/ovn-controller.c:2057 (99% CPU usage) 2020-02-03T00:03:38.462Z|03661|poll_loop|INFO|wakeup due to 0-ms timeout at ovn/controller/ovn-controller.c:2057 (99% CPU usage) 2020-02-03T00:03:38.569Z|03662|poll_loop|INFO|wakeup due to 0-ms timeout at ovn/controller/ovn-controller.c:2057 (99% CPU usage) 2020-02-03T00:03:38.677Z|03663|poll_loop|INFO|wakeup due to 0-ms timeout at ovn/controller/ovn-controller.c:2057 (99% CPU usage) 2020-02-03T00:03:43.733Z|03664|poll_loop|INFO|Dropped 46 log messages in last 5 seconds (most recently, 0 seconds ago) due to excessive rate 2020-02-03T00:03:43.734Z|03665|poll_loop|INFO|wakeup due to 0-ms timeout at ovn/controller/ovn-controller.c:2057 (100% CPU usage) 2020-02-03T00:03:49.763Z|03666|poll_loop|INFO|Dropped 55 log messages in last 6 seconds (most recently, 0 seconds ago) due to excessive rate 2020-02-03T00:03:49.763Z|03667|poll_loop|INFO|wakeup due to 0-ms timeout at ovn/controller/ovn-controller.c:2057 (100% CPU usage) 2020-02-03T00:03:55.771Z|03668|poll_loop|INFO|Dropped 55 log messages in last 6 seconds (most recently, 0 seconds ago) due to excessive rate 2020-02-03T00:03:55.771Z|03669|poll_loop|INFO|wakeup due to 0-ms timeout at ovn/controller/ovn-controller.c:2057 (99% CPU usage) 2020-02-03T00:04:01.782Z|03670|poll_loop|INFO|Dropped 55 log messages in last 6 seconds (most recently, 0 seconds ago) due to excessive rate 2020-02-03T00:04:01.782Z|03671|poll_loop|INFO|wakeup due to 0-ms timeout at ovn/controller/ovn-controller.c:2057 (99% CPU usage) 2020-02-03T00:04:07.782Z|03672|poll_loop|INFO|Dropped 55 log messages in last 6 seconds (most recently, 0 seconds ago) due to excessive rate 2020-02-03T00:04:07.782Z|03673|poll_loop|INFO|wakeup due to 0-ms timeout at ovn/controller/ovn-controller.c:2057 (99% CPU usage) 2020-02-03T00:04:10.956Z|03674|binding|INFO|Claiming lport 43d8d54f-30a2-4690-9fa7-23e059c19632 for this chassis. 2020-02-03T00:04:10.956Z|03675|binding|INFO|43d8d54f-30a2-4690-9fa7-23e059c19632: Claiming fa:16:3e:34:f2:bc 10.0.5.69 2020-02-03T00:04:33.006Z|03676|timeval|WARN|Unreasonably long 22054ms poll interval (18407ms user, 3575ms system) 2020-02-03T00:04:33.006Z|03677|timeval|WARN|faults: 2721003 minor, 0 major 2020-02-03T00:04:33.006Z|03678|timeval|WARN|disk: 0 reads, 8 writes ====================================================================== The compute node has about 50 flows in the datapath when this happens. Version-Release number of selected component (if applicable): OSP 16 ovs_version: "2.11.0" ()[root@overcloud-utn1029compute-5 /]# rpm -qa | grep ovn rhosp-openvswitch-ovn-common-2.11-0.5.el8ost.noarch ovn2.11-host-2.11.1-8.el8fdp.x86_64 puppet-ovn-15.4.1-0.20191014133046.192ac4e.el8ost.noarch ovn2.11-2.11.1-8.el8fdp.x86_64 rhosp-openvswitch-ovn-host-2.11-0.5.el8ost.noarch How reproducible: 100% when booting a large number of VMs on a scale environment Steps to Reproduce: 1. Deploy a large OSP environment with OVN 2. Boot VMs with network at scale 3. Actual results: Expected results: Additional info: The environment has patches for https://bugzilla.redhat.com/show_bug.cgi?id=1793440 and https://bugzilla.redhat.com/show_bug.cgi?id=1795198 applied
Crrection in bug description Two scenarios were tested Create 1000 CentOS VMs WITHOUT network at a rally concurrency of 16 Median time for nova.boot_server = 10s 95% time for nova.boot_server = 21s Create 1000 CentOS VMs WITH network at a rally concurrency of 16 Median time for nova.boot_server = 34s 95% time for nova.boot_server = 243s