Created attachment 1369596 [details] The full log of the test Description of problem: - call ping2 in a loop - reset vdsm - observe the flurry of messages about timeouts and reconnects - vdsm restart done - but ping2 is still stuck - call ping2 from separate client instance and observe it works Version-Release number of selected component (if applicable): vdsm-4.20.9.3-1.el7.centos How reproducible: Always Steps to Reproduce: 1. 2. 3. Actual results: ping2 is stuck Expected results: ping2 returns after reconnect logic does its job
Created attachment 1369597 [details] The test script I used
The same also bypassing ovirt_hosted_engine_ha.lib and directly creating the connection from vdsm.client as for: import logging import sys import time from vdsm import client root = logging.getLogger() root.setLevel(logging.DEBUG) ch = logging.StreamHandler(sys.stdout) ch.setLevel(logging.DEBUG) formatter = logging.Formatter( '%(asctime)s - %(name)s - %(levelname)s - %(message)s' ) ch.setFormatter(formatter) root.addHandler(ch) cli = client.connect('localhost') while 1: pong = cli.Host.ping2() print(str(pong)) time.sleep(1) cli.close()
This BZ is duplicate of BZ #1525453. Irit fix for the issue was merged today.
Are you sure it is a duplicate? I am not complaining about log flood, but about the fact that it gets stuck at the end. I can try hotfixing vdsm and testing with that though.
So, it gets stuck as well: 2017-12-18 18:09:58,662 - jsonrpc.AsyncoreClient - DEBUG - Sending response True 2017-12-18 18:09:59,667 - jsonrpc.AsyncoreClient - DEBUG - Sending response True 2017-12-18 18:10:00,672 - jsonrpc.AsyncoreClient - DEBUG - Sending response 2017-12-18 18:10:02,676 - yajsonrpc.protocols.stomp.AsyncClient - DEBUG - Timeout occurred, trying to reconnect 2017-12-18 18:10:02,678 - stomp.AsyncDispatcher - DEBUG - Communication error occurred. If I use ping2(_timeout=5) then I get a timeout: 2017-12-18 18:11:14,431 - jsonrpc.AsyncoreClient - DEBUG - Sending response True 2017-12-18 18:11:15,436 - jsonrpc.AsyncoreClient - DEBUG - Sending response 2017-12-18 18:11:17,438 - yajsonrpc.protocols.stomp.AsyncClient - DEBUG - Timeout occurred, trying to reconnect 2017-12-18 18:11:17,440 - stomp.AsyncDispatcher - DEBUG - Communication error occurred. Traceback (most recent call last): File "test.py", line 20, in <module> pong = cli.Host.ping2(_timeout = 5) File "/usr/lib/python2.7/site-packages/vdsm/client.py", line 267, in _call raise TimeoutError(method, kwargs, timeout) vdsm.client.TimeoutError: Request Host.ping2 with args {} timed out after 5 seconds I would expect the command to succeed instead.
proposing as blocker for 4.2.0. please ack / nack / retarget if this shouldn't block 4.2.0
Irit - should we indeed close this as duplicate?
It looks like a different issue - I'm checking it.
There are two issues with your test: 1. There is only 1 reconnect attempt in default mode. 2. Command timeout is 5 seconds and it probably takes a few more seconds to establish a successful connection. I saw reconnect logic works when I changed nr_retries to 2 (client.connect('localhost', nr_retries=2)) and timeout to 20s. If we reach a certain point in the command life cycle where it was already sent to the server and we lost connection, we will get a TimeoutError eventually since it's considered lost from our side. This is my code: import logging import sys import time from vdsm import client logging.basicConfig(stream=sys.stdout, level=logging.DEBUG) cli = client.connect('localhost', nr_retries=2) while 1: try: pong = cli.Host.ping2(_timeout=20) print(str(pong)) time.sleep(1) except client.TimeoutError: print("timeout error, trying to send commands again") cli.close() output: DEBUG:jsonrpc.AsyncoreClient:Sending response True DEBUG:jsonrpc.AsyncoreClient:Sending response True DEBUG:jsonrpc.AsyncoreClient:Sending response DEBUG:stomp.AsyncDispatcher:closing connection DEBUG:yajsonrpc.protocols.stomp.AsyncClient:Timeout occurred, trying to reconnect DEBUG:stomp.AsyncDispatcher:Communication error occurred. DEBUG:stomp.AsyncDispatcher:closing connection DEBUG:yajsonrpc.protocols.stomp.AsyncClient:Timeout occurred, trying to reconnect DEBUG:stomp.AsyncDispatcher:managed to connect successfully. DEBUG:stomp.AsyncDispatcher:Communication error occurred. DEBUG:stomp.AsyncDispatcher:Communication error occurred. DEBUG:stomp.AsyncDispatcher:closing connection DEBUG:yajsonrpc.protocols.stomp.AsyncClient:Timeout occurred, trying to reconnect DEBUG:stomp.AsyncDispatcher:managed to connect successfully. DEBUG:yajsonrpc.protocols.stomp.AsyncClient:Stomp connection established timeout error, trying to send commands again DEBUG:jsonrpc.AsyncoreClient:Sending response True DEBUG:jsonrpc.AsyncoreClient:Sending response True DEBUG:jsonrpc.AsyncoreClient:Sending response
This is not a bug, reconnect logic works. Martin, if you don't have any other requests regarding this let's close this bug.
What happens when no timeout is provided? Is there a default value? I was waiting for couple of minutes. Hosted engine uses timeout so this is probably not a blocker, but I still believe it might be a bug (or something missing in documentation).
AFAIK HE is using 900s as a default value, please set it to a different value if you think this is too high for the current usage. If you wish to cancel reconnect logic, please create the client with client.connect('localhost', nr_retries=0) If not specified, number of reconnect attempts will be 1 (which has no real affect on performance since it's only one short attempt).
Hosted engine is using 15 minutes indeed, but Simone saw the same with plain client. Anyway, this whole logic should probably be documented somewhere and we can close it if it is. The meaning of _timeout is pretty confusing when combined with reconnect.
I've dded more information about number of retries to client documentation. There is more information about timeouts in the same section.
Moving to 4.2.1 as this is documentation change only and does not contain any code changes. So removing any bugs dependencies
# grep -A5 -B7 "cli = client.connect('localhost', 54321, use_tls=True, nr_retries=0)" /usr/lib/python2.7/site-packages/vdsm/client.py cli = client.connect('localhost', 54321, use_tls=True, timeout=180) The client supports reconnecting in case VDSM connection is lost. Default number of attempts to reconnect is 1. In order to cancel the reconnect mechanism, please change nr_retries to 0: cli = client.connect('localhost', 54321, use_tls=True, nr_retries=0) In order to support a higher number of attempts, please pass number of retries when creating the client:: cli = client.connect('localhost', 54321, use_tls=True, nr_retries=10)
This bugzilla is included in oVirt 4.2.1 release, published on Feb 12th 2018. Since the problem described in this bug report should be resolved in oVirt 4.2.1 release, it has been closed with a resolution of CURRENT RELEASE. If the solution does not work for you, please open a new bug report.