Previously, when the RHV Manager requested a listing of available ports from the ovirt-provider-ovn, the implementation was not optimized for scaling scenarios. As a result, in scenarios with many active OVN vNICs on virtual machines, starting a virtual machine using OVN vNICs was slow and sometimes timed out.
In this release, implementation of listing ports has been optimized for scaling, as starting a VM with OVN vNICs with many active OVN vNICs is quicker.
DescriptionGermano Veit Michel
2020-05-14 03:06:24 UTC
Description of problem:
Virtual Machines and provider test connectivity were failing due to the engine timing out before the ovirt-provider-ovn replying back. The timeout is set by default to ExternalNetworkProviderTimeout=30s
See for example the engine failing to start a VM:
2020-05-12 19:58:25,657-06 INFO [org.ovirt.engine.core.bll.RunVmCommand] (default task-268) [ada943f3-39dc-4716-bbe4-dc5af5bf0389] Lock Acquired to object 'EngineLock:{exclusiveLocks='[6dce35b8-90c1-422c-a73d-010fe69048ac=VM]', sharedLocks=''}'
[...]
2020-05-12 19:58:56,106-06 ERROR [org.ovirt.engine.core.bll.RunVmCommand] (default task-268) [ada943f3-39dc-4716-bbe4-dc5af5bf0389] Command 'org.ovirt.engine.core.bll.RunVmCommand' failed: EngineException: (Failed with error Read timed out and code 5050)
Notes:
* It takes 35s to do a GET on :9696/v2.0/ports.
* The number of ports is slightly under 1000.
* After setting the timeout in engine to 90s, all the VMs started.
Also, apparently this traceback in the provider is caused by the engine timing out and closing the socket before it is able to reply back with the answer:
May 12 19:59:14 engine.example.com python2: ::ffff:192.168.0.1 - - [12/May/2020 19:59:14] "GET /v2.0/ports HTTP/1.1" 500 -
May 12 19:59:14 engine.example.com python2: Traceback (most recent call last):
May 12 19:59:14 engine.example.com python2: File "/usr/lib64/python2.7/SocketServer.py", line 295, in _handle_request_noblock
May 12 19:59:14 engine.example.com python2: self.process_request(request, client_address)
May 12 19:59:14 engine.example.com python2: File "/usr/lib64/python2.7/SocketServer.py", line 321, in process_request
May 12 19:59:14 engine.example.com python2: self.finish_request(request, client_address)
May 12 19:59:14 engine.example.com python2: File "/usr/lib64/python2.7/SocketServer.py", line 334, in finish_request
May 12 19:59:14 engine.example.com python2: self.RequestHandlerClass(request, client_address, self)
May 12 19:59:14 engine.example.com python2: File "/usr/share/ovirt-provider-ovn/handlers/base_handler.py", line 96, in __init__
May 12 19:59:14 engine.example.com python2: self._run_server(request, client_address, server)
May 12 19:59:14 engine.example.com python2: File "/usr/share/ovirt-provider-ovn/handlers/base_handler.py", line 99, in _run_server
May 12 19:59:14 engine.example.com python2: BaseHTTPRequestHandler.__init__(self, request, client_address, server)
May 12 19:59:14 engine.example.com python2: File "/usr/lib64/python2.7/SocketServer.py", line 651, in __init__
May 12 19:59:14 engine.example.com python2: self.finish()
May 12 19:59:14 engine.example.com python2: File "/usr/lib64/python2.7/SocketServer.py", line 710, in finish
May 12 19:59:14 engine.example.com python2: self.wfile.close()
May 12 19:59:14 engine.example.com python2: File "/usr/lib64/python2.7/socket.py", line 279, in close
May 12 19:59:14 engine.example.com python2: self.flush()
May 12 19:59:14 engine.example.com python2: File "/usr/lib64/python2.7/socket.py", line 303, in flush
May 12 19:59:14 engine.example.com python2: self._sock.sendall(view[write_offset:write_offset+buffer_size])
May 12 19:59:14 engine.example.com python2: File "/usr/lib64/python2.7/ssl.py", line 744, in sendall
May 12 19:59:14 engine.example.com python2: v = self.send(data[count:])
May 12 19:59:14 engine.example.com python2: File "/usr/lib64/python2.7/ssl.py", line 710, in send
May 12 19:59:14 engine.example.com python2: v = self._sslobj.write(data)
May 12 19:59:14 engine.example.com python2: error: [Errno 32] Broken pipe
Version-Release number of selected component (if applicable):
rhvm-4.3.7.2-0.1.el7.noarch
ovirt-provider-ovn-1.2.27-1.el7ev.noarch
ovn2.11-2.11.0-16.el7fdp.x86_64
rhv-openvswitch-ovn-common-2.11-5.el7ev.noarch
How reproducible:
Unable due to scale
Actual results:
OVN provider taking longer times to return responses
Comment 2Germano Veit Michel
2020-05-14 03:33:57 UTC
If I understand the code correctly, a GET on /ports will make the provider will:
For each port in "ovn-nbctl lsp-list":
get_dhcp
get_lrp
list_ls
I couldn't find much debug to be turned on in the provider relating to the ovn_connection execute method to get a better idea on what exactly is slow.
I'm not sure if its the provider or the NB.
Also have been unsucessful in manually adding ports to attempt to reproduce without a large environment, OVN does not pick them up.
Do we know with how many ports we test this?
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory (RHV Engine and Host Common Packages 4.4.z [ovirt-4.4.3]), and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.
https://access.redhat.com/errata/RHBA-2020:5212
Description of problem: Virtual Machines and provider test connectivity were failing due to the engine timing out before the ovirt-provider-ovn replying back. The timeout is set by default to ExternalNetworkProviderTimeout=30s See for example the engine failing to start a VM: 2020-05-12 19:58:25,657-06 INFO [org.ovirt.engine.core.bll.RunVmCommand] (default task-268) [ada943f3-39dc-4716-bbe4-dc5af5bf0389] Lock Acquired to object 'EngineLock:{exclusiveLocks='[6dce35b8-90c1-422c-a73d-010fe69048ac=VM]', sharedLocks=''}' [...] 2020-05-12 19:58:56,106-06 ERROR [org.ovirt.engine.core.bll.RunVmCommand] (default task-268) [ada943f3-39dc-4716-bbe4-dc5af5bf0389] Command 'org.ovirt.engine.core.bll.RunVmCommand' failed: EngineException: (Failed with error Read timed out and code 5050) Notes: * It takes 35s to do a GET on :9696/v2.0/ports. * The number of ports is slightly under 1000. * After setting the timeout in engine to 90s, all the VMs started. Also, apparently this traceback in the provider is caused by the engine timing out and closing the socket before it is able to reply back with the answer: May 12 19:59:14 engine.example.com python2: ::ffff:192.168.0.1 - - [12/May/2020 19:59:14] "GET /v2.0/ports HTTP/1.1" 500 - May 12 19:59:14 engine.example.com python2: Traceback (most recent call last): May 12 19:59:14 engine.example.com python2: File "/usr/lib64/python2.7/SocketServer.py", line 295, in _handle_request_noblock May 12 19:59:14 engine.example.com python2: self.process_request(request, client_address) May 12 19:59:14 engine.example.com python2: File "/usr/lib64/python2.7/SocketServer.py", line 321, in process_request May 12 19:59:14 engine.example.com python2: self.finish_request(request, client_address) May 12 19:59:14 engine.example.com python2: File "/usr/lib64/python2.7/SocketServer.py", line 334, in finish_request May 12 19:59:14 engine.example.com python2: self.RequestHandlerClass(request, client_address, self) May 12 19:59:14 engine.example.com python2: File "/usr/share/ovirt-provider-ovn/handlers/base_handler.py", line 96, in __init__ May 12 19:59:14 engine.example.com python2: self._run_server(request, client_address, server) May 12 19:59:14 engine.example.com python2: File "/usr/share/ovirt-provider-ovn/handlers/base_handler.py", line 99, in _run_server May 12 19:59:14 engine.example.com python2: BaseHTTPRequestHandler.__init__(self, request, client_address, server) May 12 19:59:14 engine.example.com python2: File "/usr/lib64/python2.7/SocketServer.py", line 651, in __init__ May 12 19:59:14 engine.example.com python2: self.finish() May 12 19:59:14 engine.example.com python2: File "/usr/lib64/python2.7/SocketServer.py", line 710, in finish May 12 19:59:14 engine.example.com python2: self.wfile.close() May 12 19:59:14 engine.example.com python2: File "/usr/lib64/python2.7/socket.py", line 279, in close May 12 19:59:14 engine.example.com python2: self.flush() May 12 19:59:14 engine.example.com python2: File "/usr/lib64/python2.7/socket.py", line 303, in flush May 12 19:59:14 engine.example.com python2: self._sock.sendall(view[write_offset:write_offset+buffer_size]) May 12 19:59:14 engine.example.com python2: File "/usr/lib64/python2.7/ssl.py", line 744, in sendall May 12 19:59:14 engine.example.com python2: v = self.send(data[count:]) May 12 19:59:14 engine.example.com python2: File "/usr/lib64/python2.7/ssl.py", line 710, in send May 12 19:59:14 engine.example.com python2: v = self._sslobj.write(data) May 12 19:59:14 engine.example.com python2: error: [Errno 32] Broken pipe Version-Release number of selected component (if applicable): rhvm-4.3.7.2-0.1.el7.noarch ovirt-provider-ovn-1.2.27-1.el7ev.noarch ovn2.11-2.11.0-16.el7fdp.x86_64 rhv-openvswitch-ovn-common-2.11-5.el7ev.noarch How reproducible: Unable due to scale Actual results: OVN provider taking longer times to return responses