Description of problem: Virtual Machines and provider test connectivity were failing due to the engine timing out before the ovirt-provider-ovn replying back. The timeout is set by default to ExternalNetworkProviderTimeout=30s See for example the engine failing to start a VM: 2020-05-12 19:58:25,657-06 INFO [org.ovirt.engine.core.bll.RunVmCommand] (default task-268) [ada943f3-39dc-4716-bbe4-dc5af5bf0389] Lock Acquired to object 'EngineLock:{exclusiveLocks='[6dce35b8-90c1-422c-a73d-010fe69048ac=VM]', sharedLocks=''}' [...] 2020-05-12 19:58:56,106-06 ERROR [org.ovirt.engine.core.bll.RunVmCommand] (default task-268) [ada943f3-39dc-4716-bbe4-dc5af5bf0389] Command 'org.ovirt.engine.core.bll.RunVmCommand' failed: EngineException: (Failed with error Read timed out and code 5050) Notes: * It takes 35s to do a GET on :9696/v2.0/ports. * The number of ports is slightly under 1000. * After setting the timeout in engine to 90s, all the VMs started. Also, apparently this traceback in the provider is caused by the engine timing out and closing the socket before it is able to reply back with the answer: May 12 19:59:14 engine.example.com python2: ::ffff:192.168.0.1 - - [12/May/2020 19:59:14] "GET /v2.0/ports HTTP/1.1" 500 - May 12 19:59:14 engine.example.com python2: Traceback (most recent call last): May 12 19:59:14 engine.example.com python2: File "/usr/lib64/python2.7/SocketServer.py", line 295, in _handle_request_noblock May 12 19:59:14 engine.example.com python2: self.process_request(request, client_address) May 12 19:59:14 engine.example.com python2: File "/usr/lib64/python2.7/SocketServer.py", line 321, in process_request May 12 19:59:14 engine.example.com python2: self.finish_request(request, client_address) May 12 19:59:14 engine.example.com python2: File "/usr/lib64/python2.7/SocketServer.py", line 334, in finish_request May 12 19:59:14 engine.example.com python2: self.RequestHandlerClass(request, client_address, self) May 12 19:59:14 engine.example.com python2: File "/usr/share/ovirt-provider-ovn/handlers/base_handler.py", line 96, in __init__ May 12 19:59:14 engine.example.com python2: self._run_server(request, client_address, server) May 12 19:59:14 engine.example.com python2: File "/usr/share/ovirt-provider-ovn/handlers/base_handler.py", line 99, in _run_server May 12 19:59:14 engine.example.com python2: BaseHTTPRequestHandler.__init__(self, request, client_address, server) May 12 19:59:14 engine.example.com python2: File "/usr/lib64/python2.7/SocketServer.py", line 651, in __init__ May 12 19:59:14 engine.example.com python2: self.finish() May 12 19:59:14 engine.example.com python2: File "/usr/lib64/python2.7/SocketServer.py", line 710, in finish May 12 19:59:14 engine.example.com python2: self.wfile.close() May 12 19:59:14 engine.example.com python2: File "/usr/lib64/python2.7/socket.py", line 279, in close May 12 19:59:14 engine.example.com python2: self.flush() May 12 19:59:14 engine.example.com python2: File "/usr/lib64/python2.7/socket.py", line 303, in flush May 12 19:59:14 engine.example.com python2: self._sock.sendall(view[write_offset:write_offset+buffer_size]) May 12 19:59:14 engine.example.com python2: File "/usr/lib64/python2.7/ssl.py", line 744, in sendall May 12 19:59:14 engine.example.com python2: v = self.send(data[count:]) May 12 19:59:14 engine.example.com python2: File "/usr/lib64/python2.7/ssl.py", line 710, in send May 12 19:59:14 engine.example.com python2: v = self._sslobj.write(data) May 12 19:59:14 engine.example.com python2: error: [Errno 32] Broken pipe Version-Release number of selected component (if applicable): rhvm-4.3.7.2-0.1.el7.noarch ovirt-provider-ovn-1.2.27-1.el7ev.noarch ovn2.11-2.11.0-16.el7fdp.x86_64 rhv-openvswitch-ovn-common-2.11-5.el7ev.noarch How reproducible: Unable due to scale Actual results: OVN provider taking longer times to return responses
If I understand the code correctly, a GET on /ports will make the provider will: For each port in "ovn-nbctl lsp-list": get_dhcp get_lrp list_ls I couldn't find much debug to be turned on in the provider relating to the ovn_connection execute method to get a better idea on what exactly is slow. I'm not sure if its the provider or the NB. Also have been unsucessful in manually adding ports to attempt to reproduce without a large environment, OVN does not pick them up. Do we know with how many ports we test this?
verified in build : 4.4.3.1-0.7
verified on ovn :ovirt-provider-ovn-1.2.31-1.el8ev.noarch
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (RHV Engine and Host Common Packages 4.4.z [ovirt-4.4.3]), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:5212