Bug 1835550 - ovirt-provider-ovn takes more than ExternalNetworkProviderTimeout to respond to engine requests
Summary: ovirt-provider-ovn takes more than ExternalNetworkProviderTimeout to respond ...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-provider-ovn
Version: 4.3.7
Hardware: x86_64
OS: Linux
unspecified
medium
Target Milestone: ovirt-4.4.3
: 4.4.3
Assignee: Dominik Holler
QA Contact: michal
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-05-14 03:06 UTC by Germano Veit Michel
Modified: 2023-10-06 20:01 UTC (History)
6 users (show)

Fixed In Version: ovirt-provider-ovn-1.2.31-1
Doc Type: Bug Fix
Doc Text:
Previously, when the RHV Manager requested a listing of available ports from the ovirt-provider-ovn, the implementation was not optimized for scaling scenarios. As a result, in scenarios with many active OVN vNICs on virtual machines, starting a virtual machine using OVN vNICs was slow and sometimes timed out. In this release, implementation of listing ports has been optimized for scaling, as starting a VM with OVN vNICs with many active OVN vNICs is quicker.
Clone Of:
Environment:
Last Closed: 2020-11-24 13:10:38 UTC
oVirt Team: Network
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 5068911 0 None None None 2020-05-14 03:07:38 UTC
Red Hat Product Errata RHBA-2020:5212 0 None None None 2020-11-24 13:10:45 UTC
oVirt gerrit 110225 0 master MERGED Improve scaling of port listing 2020-11-27 15:42:27 UTC
oVirt gerrit 110879 0 master MERGED Get reference to dhcp directly from lsp 2020-11-27 15:42:28 UTC

Description Germano Veit Michel 2020-05-14 03:06:24 UTC
Description of problem:

Virtual Machines and provider test connectivity were failing due to the engine timing out before the ovirt-provider-ovn replying back. The timeout is set by default to ExternalNetworkProviderTimeout=30s 

See for example the engine failing to start a VM:

2020-05-12 19:58:25,657-06 INFO  [org.ovirt.engine.core.bll.RunVmCommand] (default task-268) [ada943f3-39dc-4716-bbe4-dc5af5bf0389] Lock Acquired to object 'EngineLock:{exclusiveLocks='[6dce35b8-90c1-422c-a73d-010fe69048ac=VM]', sharedLocks=''}'
[...]
2020-05-12 19:58:56,106-06 ERROR [org.ovirt.engine.core.bll.RunVmCommand] (default task-268) [ada943f3-39dc-4716-bbe4-dc5af5bf0389] Command 'org.ovirt.engine.core.bll.RunVmCommand' failed: EngineException: (Failed with error Read timed out and code 5050)

Notes:
* It takes 35s to do a GET on :9696/v2.0/ports.
* The number of ports is slightly under 1000.
* After setting the timeout in engine to 90s, all the VMs started.

Also, apparently this traceback in the provider is caused by the engine timing out and closing the socket before it is able to reply back with the answer:

May 12 19:59:14 engine.example.com python2: ::ffff:192.168.0.1 - - [12/May/2020 19:59:14] "GET /v2.0/ports HTTP/1.1" 500 -
May 12 19:59:14 engine.example.com python2: Traceback (most recent call last):
May 12 19:59:14 engine.example.com python2: File "/usr/lib64/python2.7/SocketServer.py", line 295, in _handle_request_noblock
May 12 19:59:14 engine.example.com python2: self.process_request(request, client_address)
May 12 19:59:14 engine.example.com python2: File "/usr/lib64/python2.7/SocketServer.py", line 321, in process_request
May 12 19:59:14 engine.example.com python2: self.finish_request(request, client_address)
May 12 19:59:14 engine.example.com python2: File "/usr/lib64/python2.7/SocketServer.py", line 334, in finish_request
May 12 19:59:14 engine.example.com python2: self.RequestHandlerClass(request, client_address, self)
May 12 19:59:14 engine.example.com python2: File "/usr/share/ovirt-provider-ovn/handlers/base_handler.py", line 96, in __init__
May 12 19:59:14 engine.example.com python2: self._run_server(request, client_address, server)
May 12 19:59:14 engine.example.com python2: File "/usr/share/ovirt-provider-ovn/handlers/base_handler.py", line 99, in _run_server
May 12 19:59:14 engine.example.com python2: BaseHTTPRequestHandler.__init__(self, request, client_address, server)
May 12 19:59:14 engine.example.com python2: File "/usr/lib64/python2.7/SocketServer.py", line 651, in __init__
May 12 19:59:14 engine.example.com python2: self.finish()
May 12 19:59:14 engine.example.com python2: File "/usr/lib64/python2.7/SocketServer.py", line 710, in finish
May 12 19:59:14 engine.example.com python2: self.wfile.close()
May 12 19:59:14 engine.example.com python2: File "/usr/lib64/python2.7/socket.py", line 279, in close
May 12 19:59:14 engine.example.com python2: self.flush()
May 12 19:59:14 engine.example.com python2: File "/usr/lib64/python2.7/socket.py", line 303, in flush
May 12 19:59:14 engine.example.com python2: self._sock.sendall(view[write_offset:write_offset+buffer_size])
May 12 19:59:14 engine.example.com python2: File "/usr/lib64/python2.7/ssl.py", line 744, in sendall
May 12 19:59:14 engine.example.com python2: v = self.send(data[count:])
May 12 19:59:14 engine.example.com python2: File "/usr/lib64/python2.7/ssl.py", line 710, in send
May 12 19:59:14 engine.example.com python2: v = self._sslobj.write(data)
May 12 19:59:14 engine.example.com python2: error: [Errno 32] Broken pipe

Version-Release number of selected component (if applicable):
rhvm-4.3.7.2-0.1.el7.noarch
ovirt-provider-ovn-1.2.27-1.el7ev.noarch
ovn2.11-2.11.0-16.el7fdp.x86_64
rhv-openvswitch-ovn-common-2.11-5.el7ev.noarch

How reproducible:
Unable due to scale

Actual results:
OVN provider taking longer times to return responses

Comment 2 Germano Veit Michel 2020-05-14 03:33:57 UTC
If I understand the code correctly, a GET on /ports will make the provider will:

For each port in "ovn-nbctl lsp-list":
    get_dhcp 
    get_lrp 
    list_ls

I couldn't find much debug to be turned on in the provider relating to the ovn_connection execute method to get a better idea on what exactly is slow.
I'm not sure if its the provider or the NB.

Also have been unsucessful in manually adding ports to attempt to reproduce without a large environment, OVN does not pick them up.

Do we know with how many ports we test this?

Comment 10 michal 2020-09-10 14:06:28 UTC
verified in build : 4.4.3.1-0.7

Comment 11 michal 2020-09-10 14:29:45 UTC
verified on ovn :ovirt-provider-ovn-1.2.31-1.el8ev.noarch

Comment 19 errata-xmlrpc 2020-11-24 13:10:38 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (RHV Engine and Host Common Packages 4.4.z [ovirt-4.4.3]), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:5212


Note You need to log in before you can comment on or make changes to this bug.