1527155 – Describe jsonrpc client reconnection parameters

Bug 1527155 - Describe jsonrpc client reconnection parameters

Summary: Describe jsonrpc client reconnection parameters

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	vdsm
Classification:	oVirt
Component:	Bindings-API
Sub Component:
Version:	4.20.9.3
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	ovirt-4.2.1
Target Release:	4.20.11
Assignee:	Irit Goihman
QA Contact:	Pavol Brilla
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2017-12-18 16:44 UTC by Martin Sivák
Modified:	2018-02-12 11:54 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2018-02-12 11:54:43 UTC
oVirt Team:	Infra
Embargoed:
Dependent Products:
Flags:	rule-engine: ovirt-4.2+ lsvaty: testing_ack+

Attachments	(Terms of Use)
The full log of the test (1.71 MB, text/plain) 2017-12-18 16:44 UTC, Martin Sivák	no flags	Details
The test script I used (481 bytes, text/x-python) 2017-12-18 16:45 UTC, Martin Sivák	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1525453	0	unspecified	CLOSED	jsonrpc reconnect logic contains busy loop and floods logs	2021-02-22 00:41:40 UTC
oVirt gerrit	85585	0	master	MERGED	client: document reconnect options	2017-12-20 17:19:57 UTC

Internal Links: 1525453

Description Martin Sivák 2017-12-18 16:44:54 UTC

Created attachment 1369596 [details]
The full log of the test

Description of problem:

- call ping2 in a loop
- reset vdsm
- observe the flurry of messages about timeouts and reconnects
- vdsm restart done
- but ping2 is still stuck

- call ping2 from separate client instance and observe it works

Version-Release number of selected component (if applicable):

vdsm-4.20.9.3-1.el7.centos

How reproducible:

Always

Steps to Reproduce:
1.
2.
3.

Actual results:

ping2 is stuck

Expected results:

ping2 returns after reconnect logic does its job

Comment 1 Martin Sivák 2017-12-18 16:45:34 UTC

Created attachment 1369597 [details]
The test script I used

Comment 2 Simone Tiraboschi 2017-12-18 16:54:25 UTC

The same also bypassing ovirt_hosted_engine_ha.lib and directly creating the connection from vdsm.client
as for:

import logging
import sys
import time

from vdsm import client

root = logging.getLogger()
root.setLevel(logging.DEBUG)
ch = logging.StreamHandler(sys.stdout)
ch.setLevel(logging.DEBUG)
formatter = logging.Formatter(
    '%(asctime)s - %(name)s - %(levelname)s - %(message)s'
)
ch.setFormatter(formatter)
root.addHandler(ch)

cli = client.connect('localhost')
while 1:
    pong = cli.Host.ping2()
    print(str(pong))
    time.sleep(1)
cli.close()

Comment 3 Piotr Kliczewski 2017-12-18 16:59:38 UTC

This BZ is duplicate of BZ #1525453. Irit fix for the issue was merged today.

Comment 4 Martin Sivák 2017-12-18 17:06:57 UTC

Are you sure it is a duplicate? I am not complaining about log flood, but about the fact that it gets stuck at the end.

I can try hotfixing vdsm and testing with that though.

Comment 5 Martin Sivák 2017-12-18 17:14:10 UTC

So, it gets stuck as well:

2017-12-18 18:09:58,662 - jsonrpc.AsyncoreClient - DEBUG - Sending response
True
2017-12-18 18:09:59,667 - jsonrpc.AsyncoreClient - DEBUG - Sending response
True
2017-12-18 18:10:00,672 - jsonrpc.AsyncoreClient - DEBUG - Sending response
2017-12-18 18:10:02,676 - yajsonrpc.protocols.stomp.AsyncClient - DEBUG - Timeout occurred, trying to reconnect
2017-12-18 18:10:02,678 - stomp.AsyncDispatcher - DEBUG - Communication error occurred.


If I use ping2(_timeout=5) then I get a timeout:

2017-12-18 18:11:14,431 - jsonrpc.AsyncoreClient - DEBUG - Sending response
True
2017-12-18 18:11:15,436 - jsonrpc.AsyncoreClient - DEBUG - Sending response
2017-12-18 18:11:17,438 - yajsonrpc.protocols.stomp.AsyncClient - DEBUG - Timeout occurred, trying to reconnect
2017-12-18 18:11:17,440 - stomp.AsyncDispatcher - DEBUG - Communication error occurred.
Traceback (most recent call last):
  File "test.py", line 20, in <module>
    pong = cli.Host.ping2(_timeout = 5)
  File "/usr/lib/python2.7/site-packages/vdsm/client.py", line 267, in _call
    raise TimeoutError(method, kwargs, timeout)
vdsm.client.TimeoutError: Request Host.ping2 with args {} timed out after 5 seconds

I would expect the command to succeed instead.

Comment 6 Sandro Bonazzola 2017-12-18 17:21:51 UTC

proposing as blocker for 4.2.0. please ack / nack / retarget if this shouldn't block 4.2.0

Comment 7 Oved Ourfali 2017-12-19 06:48:18 UTC

Irit - should we indeed close this as duplicate?

Comment 8 Irit Goihman 2017-12-19 07:18:25 UTC

It looks like a different issue - I'm checking it.

Comment 9 Irit Goihman 2017-12-19 09:30:30 UTC

There are two issues with your test:
1. There is only 1 reconnect attempt in default mode.
2. Command timeout is 5 seconds and it probably takes a few more seconds to establish a successful connection.

I saw reconnect logic works when I changed nr_retries to 2 (client.connect('localhost', nr_retries=2)) and timeout to 20s.

If we reach a certain point in the command life cycle where it was already sent to the server and we lost connection, we will get a TimeoutError eventually since it's considered lost from our side.

This is my code:

import logging
import sys
import time

from vdsm import client

logging.basicConfig(stream=sys.stdout, level=logging.DEBUG)

cli = client.connect('localhost', nr_retries=2)
while 1:
    try:
        pong = cli.Host.ping2(_timeout=20)
        print(str(pong))
        time.sleep(1)
    except client.TimeoutError:
        print("timeout error, trying to send commands again")
cli.close()


output:

DEBUG:jsonrpc.AsyncoreClient:Sending response
True
DEBUG:jsonrpc.AsyncoreClient:Sending response
True
DEBUG:jsonrpc.AsyncoreClient:Sending response
DEBUG:stomp.AsyncDispatcher:closing connection
DEBUG:yajsonrpc.protocols.stomp.AsyncClient:Timeout occurred, trying to reconnect
DEBUG:stomp.AsyncDispatcher:Communication error occurred.
DEBUG:stomp.AsyncDispatcher:closing connection
DEBUG:yajsonrpc.protocols.stomp.AsyncClient:Timeout occurred, trying to reconnect
DEBUG:stomp.AsyncDispatcher:managed to connect successfully.
DEBUG:stomp.AsyncDispatcher:Communication error occurred.
DEBUG:stomp.AsyncDispatcher:Communication error occurred.
DEBUG:stomp.AsyncDispatcher:closing connection
DEBUG:yajsonrpc.protocols.stomp.AsyncClient:Timeout occurred, trying to reconnect
DEBUG:stomp.AsyncDispatcher:managed to connect successfully.
DEBUG:yajsonrpc.protocols.stomp.AsyncClient:Stomp connection established
timeout error, trying to send commands again
DEBUG:jsonrpc.AsyncoreClient:Sending response
True
DEBUG:jsonrpc.AsyncoreClient:Sending response
True
DEBUG:jsonrpc.AsyncoreClient:Sending response

Comment 10 Irit Goihman 2017-12-19 11:54:07 UTC

This is not a bug, reconnect logic works.
Martin, if you don't have any other requests regarding this let's close this bug.

Comment 11 Martin Sivák 2017-12-19 13:02:05 UTC

What happens when no timeout is provided? Is there a default value? I was waiting for couple of minutes. 

Hosted engine uses timeout so this is probably not a blocker, but I still believe it might be a bug (or something missing in documentation).

Comment 12 Irit Goihman 2017-12-19 14:23:30 UTC

AFAIK HE is using 900s as a default value, please set it to a different value if you think this is too high for the current usage.

If you wish to cancel reconnect logic, please create the client with  client.connect('localhost', nr_retries=0)

If not specified, number of reconnect attempts will be 1 (which has no real affect on performance since it's only one short attempt).

Comment 13 Martin Sivák 2017-12-19 14:32:25 UTC

Hosted engine is using 15 minutes indeed, but Simone saw the same with plain client.

Anyway, this whole logic should probably be documented somewhere and we can close it if it is. The meaning of _timeout is pretty confusing when combined with reconnect.

Comment 14 Irit Goihman 2017-12-19 14:37:13 UTC

I've dded more information about number of retries to client documentation.
There is more information about timeouts in the same section.

Comment 15 Martin Perina 2017-12-19 18:59:01 UTC

Moving to 4.2.1 as this is documentation change only and does not contain any code changes. So removing any bugs dependencies

Comment 16 Pavol Brilla 2018-01-17 12:00:31 UTC

# grep -A5 -B7 "cli = client.connect('localhost', 54321, use_tls=True, nr_retries=0)" /usr/lib/python2.7/site-packages/vdsm/client.py

    cli = client.connect('localhost', 54321, use_tls=True, timeout=180)

The client supports reconnecting in case VDSM connection is lost.
Default number of attempts to reconnect is 1. In order to cancel the reconnect
mechanism, please change nr_retries to 0:

    cli = client.connect('localhost', 54321, use_tls=True, nr_retries=0)

In order to support a higher number of attempts, please pass number of retries
when creating the client::

    cli = client.connect('localhost', 54321, use_tls=True, nr_retries=10)

Comment 17 Sandro Bonazzola 2018-02-12 11:54:43 UTC

This bugzilla is included in oVirt 4.2.1 release, published on Feb 12th 2018.

Since the problem described in this bug report should be
resolved in oVirt 4.2.1 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.

Note You need to log in before you can comment on or make changes to this bug.