Bug 1525453

Summary: jsonrpc reconnect logic contains busy loop and floods logs
Product: [oVirt] vdsm Reporter: Martin Sivák <msivak>
Component: Bindings-APIAssignee: Irit Goihman <igoihman>
Status: CLOSED CURRENTRELEASE QA Contact: Polina <pagranat>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.20.9CC: bugs, khakimi, lsvaty, mperina
Target Milestone: ovirt-4.2.1Flags: rule-engine: ovirt-4.2+
Target Release: 4.20.11   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-02-22 10:01:46 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Infra RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Martin Sivák 2017-12-13 11:08:40 UTC
Description of problem:

We saw this as part of https://bugzilla.redhat.com/show_bug.cgi?id=1524119

This happens for example on hosted engine environment when VDSM dies. The new jsonrpc logic tries to reconnect.. flooding the logs in the process:

Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/submonitor_base.py", line 115, in _worker
    self.action(self._options)
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/submonitors/mgmt_bridge.py", line 44, in action
    logger=self._log
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/util.py", line 442, in connect_vdsm_json_rpc
    __vdsm_json_rpc_connect(logger, timeout)
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/util.py", line 398, in __vdsm_json_rpc_connect
    timeout=VDSM_MAX_RETRY * VDSM_DELAY
RuntimeError: Couldn't  connect to VDSM within 15 seconds
Client localhost:54321::WARNING::2017-12-12 13:46:53,507::betterAsyncore::177::vds.dispatcher::(log_info) unhandled close event
Client localhost:54321::WARNING::2017-12-12 13:46:53,508::betterAsyncore::177::vds.dispatcher::(log_info) unhandled close event
Client localhost:54321::WARNING::2017-12-12 13:46:53,509::betterAsyncore::177::vds.dispatcher::(log_info) unhandled close event
Client localhost:54321::WARNING::2017-12-12 13:46:53,511::betterAsyncore::177::vds.dispatcher::(log_info) unhandled close event
Client localhost:54321::WARNING::2017-12-12 13:46:53,512::betterAsyncore::177::vds.dispatcher::(log_info) unhandled close event

Notice the timestamp increments are close to 1 ms (that is millisecond!) per line.

Btw: we were looking for a way to control reconnect attempt frequency and reconnect attempt count limit, but found nothing. It would be wise to provide such tunables to avoid proper error handling when vdsm is really and truly gone.

Version-Release number of selected component (if applicable):

oVirt 4.2.0 RC1
vdsm 4.20.9

How reproducible:

- Always, start hosted engine and then kill VDSM
- watch the hosted engine's broker.log

Additional info:

We saw something similar yesterday here https://bugzilla.redhat.com/show_bug.cgi?id=1524119#c7 (which was triggered by https://bugzilla.redhat.com/show_bug.cgi?id=1522878)

14:14:09,048::stomp::617::yajsonrpc.protocols.stomp.AsyncClient::(handle_timeout) Timeout occurred, trying to reconnect
Client localhost:54321::DEBUG::2017-12-12 14:14:09,049::stomp::617::yajsonrpc.protocols.stomp.AsyncClient::(handle_timeout) Timeout occurred, trying to reconnect
Client localhost:54321::DEBUG::2017-12-12 14:14:09,051::stomp::617::yajsonrpc.protocols.stomp.AsyncClient::(handle_timeout) Timeout occurred, trying to reconnect

Comment 1 Irit Goihman 2017-12-28 08:50:53 UTC
*** Bug 1529458 has been marked as a duplicate of this bug. ***

Comment 2 Doron Fediuck 2018-01-01 08:42:54 UTC
*** Bug 1529465 has been marked as a duplicate of this bug. ***

Comment 3 Polina 2018-02-12 14:33:54 UTC
Verified in 
    ovirt-hosted-engine-setup-2.2.6-1.el7ev.noarch.

In /var/log/ovirt-hosted-engine-ha/broker.log
reported
RuntimeError: Couldn't  connect to VDSM within 60 seconds.
no jsonrpc reportings

Comment 4 Sandro Bonazzola 2018-02-22 10:01:46 UTC
This bugzilla is included in oVirt 4.2.1 release, published on Feb 12th 2018.

Since the problem described in this bug report should be
resolved in oVirt 4.2.1 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.