Bug 1377216

Summary: jsonrpcvdscli: AttributeError exception reporting issues with Host.getHardwareInfo, JsonRpcNoResponseError at application eyes
Product: [oVirt] vdsm Reporter: Simone Tiraboschi <stirabos>
Component: Bindings-APIAssignee: Irit Goihman <igoihman>
Status: CLOSED CURRENTRELEASE QA Contact: Jiri Belka <jbelka>
Severity: high Docs Contact:
Priority: high    
Version: 4.18.11CC: aperotti, bugs, fromani, jbelka, mgoldboi, mperina, oourfali, stirabos, ybronhei
Target Milestone: ovirt-4.0.5Flags: rule-engine: ovirt-4.0.z+
mgoldboi: planning_ack+
mperina: devel_ack+
pstehlik: testing_ack+
Target Release: 4.18.12   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-01-18 07:40:06 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Infra RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Attachments:
Description Flags
vdsm.log none

Description Simone Tiraboschi 2016-09-19 08:45:58 UTC
Description of problem:

If, for any reason, we have an issue gathering Host.getHardwareInfo info, the json rpc client hits an AttributeError exception:

Reactor thread::DEBUG::2016-09-08 15:44:40,357::stompreactor::492::protocoldetector.StompDetector::(handle_socket) Stomp detected from ('::ffff:192.168.105.65', 59983)
jsonrpc.Executor/0::INFO::2016-09-08 15:44:41,742::__init__::524::jsonrpc.JsonRpcServer::(_handle_request) In recovery, ignoring 'Host.getHardwareInfo' in bridge with {}
jsonrpc.Executor/0::ERROR::2016-09-08 15:44:41,742::executor::193::Executor::(_execute_task) Unhandled exception in Task(callable=<functools.partial object at 0x25c5788>, timeout=None)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 191, in _execute_task
    task.callable()
  File "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 510, in _serveRequest
    response_log = "failed (error %s)" % (error.code,)
AttributeError: 'dict' object has no attribute 'code'
storageRefresh::DEBUG::2016-09-08 15:44:43,897::multipath::78::Storage.Misc.excCmd::(rescan) SUCCESS: <err> = ''; <rc> = 0

And the application (hosted-engine-setup in this case) gets no response at all:
2016-09-08 16:26:08 DEBUG otopi.context context._executeMethod:142 method exception
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/otopi/context.py", line 132, in _executeMethod
    method['method']()
  File "/usr/share/ovirt-hosted-engine-setup/scripts/../plugins/gr-he-setup/system/vdsmenv.py", line 157, in _misc
    timeout=ohostedcons.Const.VDSCLI_SSL_TIMEOUT,
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/util.py", line 207, in connect_vdsm_json_rpc
    hwinfo = cli.getVdsHardwareInfo()
  File "/usr/lib/python2.7/site-packages/vdsm/jsonrpcvdscli.py", line 153, in _callMethod
    raise JsonRpcNoResponseError(method)
JsonRpcNoResponseError: [-32605] No response for JSON-RPC Host.getHardwareInfo request.


Version-Release number of selected component (if applicable):


How reproducible:
?

Steps to Reproduce:
1.
2.
3.

Actual results:
AttributeError: 'dict' object has no attribute 'code'
in VDSM logs, JsonRpcNoResponseError in the application (after a long timeout, 900 seconds).

Expected results:
The Host.getHardwareInfo fails with a clear error.

Additional info:

Comment 2 Piotr Kliczewski 2016-09-19 08:59:06 UTC
It looks like this issue is caused by:

jsonrpc.Executor/0::INFO::2016-09-08 14:42:34,291::__init__::524::jsonrpc.JsonRpcServer::(_handle_request) In recovery, ignoring 'Host.getHardwareInfo' in bridge with {}

Comment 7 Yaniv Bronhaim 2016-09-27 08:20:11 UTC
From the above - The print "In recovery, ignoring '%s' in bridge with %s" appears if vdsmd initialization phase was not finished before the request arrived. This can happens if host runs large amount of vms which sometimes leads to long initialization period, or something that takes time during the storage initialization. It has nothing to do with the request itself, nor the getHardwareInfo API.

If you can reproduce the exact load that caused vdsm this long period we'll be able to investigate it better, but it also can be connectivity issues to the sd.

Please update the bug if you have more info, if it was resolved by changing configuration (referring to comment #5) please close or recompose the bug description to be about the LUN cache policy.

In addition, I'm not sure why vdsmd was restarting in the first place.. this also might be an issue. I can't access the ticket - if you can please attach vdsm.log here it would be better.

Comment 8 Oved Ourfali 2016-09-27 08:24:13 UTC
Created attachment 1205104 [details]
vdsm.log

Comment 9 Simone Tiraboschi 2016-09-27 08:43:23 UTC
(In reply to Yaniv Bronhaim from comment #7)
> From the above - The print "In recovery, ignoring '%s' in bridge with %s"
> appears if vdsmd initialization phase was not finished before the request

Let me try to explain the issue.
The context: this happened in hosted-engine-setup. In hosted-engine-setup VDSM is firstly initialized with a temporary cert just to be able to connect to it to setup up the storage and create the first VM for the engine.
Once we have an engine, hosted-engine-setup will call hosts.add to add the host and this will trigger host-deploy to redeploy the host and this means that vdsm certs will be replaced by the ones signed by the engine and vdsm will restart as for normal host-deploy flow.

In the mean time, hosted-engine-setup is polling VDSM to when we can reconnect.
The polling cycle is here:
https://gerrit.ovirt.org/gitweb?p=ovirt-hosted-engine-ha.git;a=blob;f=ovirt_hosted_engine_ha/lib/util.py;h=512f4670a98650c542500ec7d1ebc675985bdf06;hb=HEAD#l204
It calls getVdsHardwareInfo till it gets a response with code=0 to be sure that VDSM is ready.

Now the issue:
since getVdsHardwareInfo got skipped due to the recovering status, jsonrpcvdscli should return something with exit code!=0.
But there is a bug in jsonrpcvdscli.py (AttributeError: 'dict' object has no attribute 'code') and so nothing reaches back the application (hosted-engine-setup) that waits till a timeout on JsonRpcClient and then aborts.

Being a timing issue it's not simply reproducible but the issue is there (
jsonrpc.Executor/0::ERROR::2016-09-08 15:44:41,742::executor::193::Executor::(_execute_task) Unhandled exception in Task(callable=<functools.partial object at 0x25c5788>, timeout=None)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 191, in _execute_task
    task.callable()
  File "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 510, in _serveRequest
    response_log = "failed (error %s)" % (error.code,)
AttributeError: 'dict' object has no attribute 'code'
).

In this case there is just a VM (the one with the engine) and so the increased startup delay was probably due to the amount of connected LUN on the host.

Comment 10 Simone Tiraboschi 2016-09-27 09:03:49 UTC
https://gerrit.ovirt.org/#/c/61613/ should probably fix it.
Not sure why we saw it again in 4.8.11-1.

Comment 11 Irit Goihman 2016-09-27 09:10:46 UTC
https://gerrit.ovirt.org/#/c/61613/ solves the issue. It was fixed in 4.18.12.
Please upgrade vdsm version to 4.18.12 or higher.

Comment 12 Francesco Romani 2016-09-27 09:12:52 UTC
(In reply to Irit Goihman from comment #11)
> https://gerrit.ovirt.org/#/c/61613/ solves the issue. It was fixed in
> 4.18.12.
> Please upgrade vdsm version to 4.18.12 or higher.

Correct. For reference, Vdsm 4.8.11 is just 4.8.10 + only the fix for https://bugzilla.redhat.com/1364149

Comment 13 Oved Ourfali 2016-09-27 10:23:35 UTC
Moving to ON_QA.

Comment 14 Jiri Belka 2016-10-11 15:37:42 UTC
What are verification steps for QA? Thx.

Comment 15 Simone Tiraboschi 2016-10-24 09:22:47 UTC
(In reply to Jiri Belka from comment #14)
> What are verification steps for QA? Thx.

Restart vdsmd via systemctl.
While vdsm is restarting (the exact timing depends on different factors but try to be as quick as possible), run:
python -c 'from ovirt_hosted_engine_ha.lib import util as ohautil; cli = ohautil.connect_vdsm_json_rpc(logger=None, timeout=60); print(cli.getVdsHardwareInfo())'

You shouldn't see any error.

Comment 16 Jiri Belka 2016-10-25 13:39:31 UTC
ok, vdsm-4.18.15.2-1.el7ev.x86_64

can't get problematic exception

# echo 'systemctl restart vdsmd' | at now ; python -c 'from ovirt_hosted_engine_ha.lib import util as ohautil; cli = ohautil.connect_vdsm_json_rpc(logger=None, timeout=60); print(cli.getVdsHardwareInfo())'
job 12 at Tue Oct 25 15:38:00 2016
{u'systemProductName': u'PowerEdge R210 II', u'systemUUID': u'4C4C4544-0058-3410-8058-C2C04F38354A', u'systemSerialNumber': u'BX4X85J', 'status': {'message': 'Done', 'code': 0}, u'systemManufacturer': u'Dell Inc.'}