Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1797407

Summary:	Instance boots fail in a large scale environment due to VIF Plugging timeouts- ovn controller high cpu usage
Product:	Red Hat OpenStack	Reporter:	Sai Sindhur Malleni <smalleni>
Component:	python-networking-ovn	Assignee:	OSP Team <rhos-maint>
Status:	CLOSED NOTABUG	QA Contact:	Eran Kuris <ekuris>
Severity:	unspecified	Docs Contact:
Priority:	unspecified
Version:	16.0 (Train)	CC:	apevec, jlibosva, ksambor, lhh, majopela, scohen
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2021-09-30 14:47:20 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1797685
Bug Blocks:

Description Sai Sindhur Malleni 2020-02-03 03:41:14 UTC

Description of problem: We have a 3 controller+3 ceph + 290 compute node baremetal environment for scale testing. When we run the rally scenario that boots vms on a network https://github.com/cloud-bulldozer/browbeat/blob/master/rally/nova/nova-boot-list-cc.yml we see some VM boots taking exceptionally long and eventually timing out due to VIF plugging timeouts. 

Two scenarios were tested
Create 1000 CentOS VMs with network at a rally concurrency of 16

Median time for nova.boot_server = 10s
95% time for nova.boot_server = 21s

Create 1000 CentOS VMs without network at a rally concurrency of 16
Median time for nova.boot_server = 34s
95% time for nova.boot_server = 243s

AS we can see, booting a server with network takes siginicantly longer than without network which is understandable due to the additional processing and signalling involved with VIF Plugging, but the 95% of boot times for VMs with network is *extremely* high at 243s.

Also, out of the 1000 VMs, 30 VMs went into ERROR as they reached the vif_plugging_timeout of 300s.

=============================================================================
On the compute nodes with this issue, examining the logs shows 100% cpu usage by ovn controller and messages like 
2020-02-03T00:02:35.918Z|03631|timeval|WARN|faults: 2021939 minor, 0 major
2020-02-03T00:02:35.918Z|03632|timeval|WARN|context switches: 0 voluntary, 21 involuntary
2020-02-03T00:02:35.918Z|03633|coverage|INFO|Dropped 2 log messages in last 41 seconds (most recently, 20 seconds ago) due to excessive rate
2020-02-03T00:02:35.918Z|03634|coverage|INFO|Skipping details of duplicate event coverage for hash=8244bb6c
2020-02-03T00:02:35.918Z|03635|poll_loop|INFO|wakeup due to 0-ms timeout at tcp:172.25.1.136:6642 (100% CPU usage)
2020-02-03T00:02:56.188Z|03636|timeval|WARN|Unreasonably long 20269ms poll interval (17152ms user, 3047ms system)
2020-02-03T00:02:56.188Z|03637|timeval|WARN|faults: 2085525 minor, 0 major
2020-02-03T00:02:56.188Z|03638|timeval|WARN|context switches: 0 voluntary, 23 involuntary
2020-02-03T00:02:56.188Z|03639|poll_loop|INFO|wakeup due to [POLLIN] on fd 20 (<->/var/run/openvswitch/br-int.mgmt) at lib/stream-fd.c:157 (99% CPU usage)
2020-02-03T00:03:17.002Z|03640|timeval|WARN|Unreasonably long 20814ms poll interval (17689ms user, 3052ms system)
2020-02-03T00:03:17.002Z|03641|timeval|WARN|faults: 2143100 minor, 0 major
2020-02-03T00:03:17.002Z|03642|timeval|WARN|context switches: 2 voluntary, 22 involuntary
2020-02-03T00:03:17.002Z|03643|poll_loop|INFO|wakeup due to [POLLIN] on fd 20 (<->/var/run/openvswitch/br-int.mgmt) at lib/stream-fd.c:157 (99% CPU usage)
2020-02-03T00:03:37.157Z|03644|timeval|WARN|Unreasonably long 20155ms poll interval (17127ms user, 2960ms system)
2020-02-03T00:03:37.157Z|03645|timeval|WARN|faults: 2021933 minor, 0 major
2020-02-03T00:03:37.157Z|03646|timeval|WARN|context switches: 0 voluntary, 23 involuntary
2020-02-03T00:03:37.157Z|03647|coverage|INFO|Dropped 2 log messages in last 41 seconds (most recently, 20 seconds ago) due to excessive rate
2020-02-03T00:03:37.157Z|03648|coverage|INFO|Skipping details of duplicate event coverage for hash=8244bb6c
2020-02-03T00:03:37.157Z|03649|poll_loop|INFO|wakeup due to [POLLIN] on fd 19 (172.25.1.208:58946<->172.25.1.136:6642) at lib/stream-fd.c:157 (99% CPU usage)
2020-02-03T00:03:37.275Z|03650|poll_loop|INFO|wakeup due to 0-ms timeout at ovn/controller/ovn-controller.c:2057 (99% CPU usage)
2020-02-03T00:03:37.383Z|03651|poll_loop|INFO|wakeup due to 0-ms timeout at ovn/controller/ovn-controller.c:2057 (99% CPU usage)
2020-02-03T00:03:37.491Z|03652|poll_loop|INFO|wakeup due to 0-ms timeout at ovn/controller/ovn-controller.c:2057 (99% CPU usage)
2020-02-03T00:03:37.599Z|03653|poll_loop|INFO|wakeup due to 0-ms timeout at ovn/controller/ovn-controller.c:2057 (99% CPU usage)
2020-02-03T00:03:37.707Z|03654|poll_loop|INFO|wakeup due to 0-ms timeout at ovn/controller/ovn-controller.c:2057 (99% CPU usage)
2020-02-03T00:03:37.815Z|03655|poll_loop|INFO|wakeup due to 0-ms timeout at ovn/controller/ovn-controller.c:2057 (99% CPU usage)
2020-02-03T00:03:37.923Z|03656|poll_loop|INFO|wakeup due to 0-ms timeout at ovn/controller/ovn-controller.c:2057 (99% CPU usage)
2020-02-03T00:03:38.030Z|03657|poll_loop|INFO|wakeup due to 0-ms timeout at ovn/controller/ovn-controller.c:2057 (99% CPU usage)
2020-02-03T00:03:38.137Z|03658|poll_loop|INFO|wakeup due to 0-ms timeout at ovn/controller/ovn-controller.c:2057 (99% CPU usage)
2020-02-03T00:03:38.246Z|03659|poll_loop|INFO|wakeup due to 0-ms timeout at ovn/controller/ovn-controller.c:2057 (99% CPU usage)
2020-02-03T00:03:38.353Z|03660|poll_loop|INFO|wakeup due to 0-ms timeout at ovn/controller/ovn-controller.c:2057 (99% CPU usage)
2020-02-03T00:03:38.462Z|03661|poll_loop|INFO|wakeup due to 0-ms timeout at ovn/controller/ovn-controller.c:2057 (99% CPU usage)
2020-02-03T00:03:38.569Z|03662|poll_loop|INFO|wakeup due to 0-ms timeout at ovn/controller/ovn-controller.c:2057 (99% CPU usage)
2020-02-03T00:03:38.677Z|03663|poll_loop|INFO|wakeup due to 0-ms timeout at ovn/controller/ovn-controller.c:2057 (99% CPU usage)
2020-02-03T00:03:43.733Z|03664|poll_loop|INFO|Dropped 46 log messages in last 5 seconds (most recently, 0 seconds ago) due to excessive rate
2020-02-03T00:03:43.734Z|03665|poll_loop|INFO|wakeup due to 0-ms timeout at ovn/controller/ovn-controller.c:2057 (100% CPU usage)
2020-02-03T00:03:49.763Z|03666|poll_loop|INFO|Dropped 55 log messages in last 6 seconds (most recently, 0 seconds ago) due to excessive rate
2020-02-03T00:03:49.763Z|03667|poll_loop|INFO|wakeup due to 0-ms timeout at ovn/controller/ovn-controller.c:2057 (100% CPU usage)
2020-02-03T00:03:55.771Z|03668|poll_loop|INFO|Dropped 55 log messages in last 6 seconds (most recently, 0 seconds ago) due to excessive rate
2020-02-03T00:03:55.771Z|03669|poll_loop|INFO|wakeup due to 0-ms timeout at ovn/controller/ovn-controller.c:2057 (99% CPU usage)
2020-02-03T00:04:01.782Z|03670|poll_loop|INFO|Dropped 55 log messages in last 6 seconds (most recently, 0 seconds ago) due to excessive rate
2020-02-03T00:04:01.782Z|03671|poll_loop|INFO|wakeup due to 0-ms timeout at ovn/controller/ovn-controller.c:2057 (99% CPU usage)
2020-02-03T00:04:07.782Z|03672|poll_loop|INFO|Dropped 55 log messages in last 6 seconds (most recently, 0 seconds ago) due to excessive rate
2020-02-03T00:04:07.782Z|03673|poll_loop|INFO|wakeup due to 0-ms timeout at ovn/controller/ovn-controller.c:2057 (99% CPU usage)
2020-02-03T00:04:10.956Z|03674|binding|INFO|Claiming lport 43d8d54f-30a2-4690-9fa7-23e059c19632 for this chassis.
2020-02-03T00:04:10.956Z|03675|binding|INFO|43d8d54f-30a2-4690-9fa7-23e059c19632: Claiming fa:16:3e:34:f2:bc 10.0.5.69
2020-02-03T00:04:33.006Z|03676|timeval|WARN|Unreasonably long 22054ms poll interval (18407ms user, 3575ms system)
2020-02-03T00:04:33.006Z|03677|timeval|WARN|faults: 2721003 minor, 0 major
2020-02-03T00:04:33.006Z|03678|timeval|WARN|disk: 0 reads, 8 writes
======================================================================


The compute node has about 50 flows in the datapath when this happens.

Version-Release number of selected component (if applicable):
OSP 16
ovs_version: "2.11.0"
()[root@overcloud-utn1029compute-5 /]# rpm -qa | grep ovn
rhosp-openvswitch-ovn-common-2.11-0.5.el8ost.noarch
ovn2.11-host-2.11.1-8.el8fdp.x86_64
puppet-ovn-15.4.1-0.20191014133046.192ac4e.el8ost.noarch
ovn2.11-2.11.1-8.el8fdp.x86_64
rhosp-openvswitch-ovn-host-2.11-0.5.el8ost.noarch

How reproducible:
100% when booting a large number of VMs on a scale environment

Steps to Reproduce:
1. Deploy  a large OSP environment with OVN
2. Boot VMs with network at scale
3.

Actual results:


Expected results:


Additional info:
The environment has patches for 
https://bugzilla.redhat.com/show_bug.cgi?id=1793440 and https://bugzilla.redhat.com/show_bug.cgi?id=1795198 applied

Comment 1 Sai Sindhur Malleni 2020-02-03 03:42:49 UTC

Crrection in bug description
Two scenarios were tested
Create 1000 CentOS VMs WITHOUT network at a rally concurrency of 16

Median time for nova.boot_server = 10s
95% time for nova.boot_server = 21s

Create 1000 CentOS VMs WITH network at a rally concurrency of 16
Median time for nova.boot_server = 34s
95% time for nova.boot_server = 243s