| Summary: | jsonrpcvdscli: AttributeError exception reporting issues with Host.getHardwareInfo, JsonRpcNoResponseError at application eyes | ||||||
|---|---|---|---|---|---|---|---|
| Product: | [oVirt] vdsm | Reporter: | Simone Tiraboschi <stirabos> | ||||
| Component: | Bindings-API | Assignee: | Irit Goihman <igoihman> | ||||
| Status: | CLOSED CURRENTRELEASE | QA Contact: | Jiri Belka <jbelka> | ||||
| Severity: | high | Docs Contact: | |||||
| Priority: | high | ||||||
| Version: | 4.18.11 | CC: | aperotti, bugs, fromani, jbelka, mgoldboi, mperina, oourfali, stirabos, ybronhei | ||||
| Target Milestone: | ovirt-4.0.5 | Flags: | rule-engine:
ovirt-4.0.z+
mgoldboi: planning_ack+ mperina: devel_ack+ pstehlik: testing_ack+ |
||||
| Target Release: | 4.18.12 | ||||||
| Hardware: | Unspecified | ||||||
| OS: | Unspecified | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2017-01-18 07:40:06 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | Infra | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Attachments: |
|
||||||
It looks like this issue is caused by:
jsonrpc.Executor/0::INFO::2016-09-08 14:42:34,291::__init__::524::jsonrpc.JsonRpcServer::(_handle_request) In recovery, ignoring 'Host.getHardwareInfo' in bridge with {}
From the above - The print "In recovery, ignoring '%s' in bridge with %s" appears if vdsmd initialization phase was not finished before the request arrived. This can happens if host runs large amount of vms which sometimes leads to long initialization period, or something that takes time during the storage initialization. It has nothing to do with the request itself, nor the getHardwareInfo API. If you can reproduce the exact load that caused vdsm this long period we'll be able to investigate it better, but it also can be connectivity issues to the sd. Please update the bug if you have more info, if it was resolved by changing configuration (referring to comment #5) please close or recompose the bug description to be about the LUN cache policy. In addition, I'm not sure why vdsmd was restarting in the first place.. this also might be an issue. I can't access the ticket - if you can please attach vdsm.log here it would be better. Created attachment 1205104 [details]
vdsm.log
(In reply to Yaniv Bronhaim from comment #7) > From the above - The print "In recovery, ignoring '%s' in bridge with %s" > appears if vdsmd initialization phase was not finished before the request Let me try to explain the issue. The context: this happened in hosted-engine-setup. In hosted-engine-setup VDSM is firstly initialized with a temporary cert just to be able to connect to it to setup up the storage and create the first VM for the engine. Once we have an engine, hosted-engine-setup will call hosts.add to add the host and this will trigger host-deploy to redeploy the host and this means that vdsm certs will be replaced by the ones signed by the engine and vdsm will restart as for normal host-deploy flow. In the mean time, hosted-engine-setup is polling VDSM to when we can reconnect. The polling cycle is here: https://gerrit.ovirt.org/gitweb?p=ovirt-hosted-engine-ha.git;a=blob;f=ovirt_hosted_engine_ha/lib/util.py;h=512f4670a98650c542500ec7d1ebc675985bdf06;hb=HEAD#l204 It calls getVdsHardwareInfo till it gets a response with code=0 to be sure that VDSM is ready. Now the issue: since getVdsHardwareInfo got skipped due to the recovering status, jsonrpcvdscli should return something with exit code!=0. But there is a bug in jsonrpcvdscli.py (AttributeError: 'dict' object has no attribute 'code') and so nothing reaches back the application (hosted-engine-setup) that waits till a timeout on JsonRpcClient and then aborts. Being a timing issue it's not simply reproducible but the issue is there ( jsonrpc.Executor/0::ERROR::2016-09-08 15:44:41,742::executor::193::Executor::(_execute_task) Unhandled exception in Task(callable=<functools.partial object at 0x25c5788>, timeout=None) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 191, in _execute_task task.callable() File "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 510, in _serveRequest response_log = "failed (error %s)" % (error.code,) AttributeError: 'dict' object has no attribute 'code' ). In this case there is just a VM (the one with the engine) and so the increased startup delay was probably due to the amount of connected LUN on the host. https://gerrit.ovirt.org/#/c/61613/ should probably fix it. Not sure why we saw it again in 4.8.11-1. https://gerrit.ovirt.org/#/c/61613/ solves the issue. It was fixed in 4.18.12. Please upgrade vdsm version to 4.18.12 or higher. (In reply to Irit Goihman from comment #11) > https://gerrit.ovirt.org/#/c/61613/ solves the issue. It was fixed in > 4.18.12. > Please upgrade vdsm version to 4.18.12 or higher. Correct. For reference, Vdsm 4.8.11 is just 4.8.10 + only the fix for https://bugzilla.redhat.com/1364149 Moving to ON_QA. What are verification steps for QA? Thx. (In reply to Jiri Belka from comment #14) > What are verification steps for QA? Thx. Restart vdsmd via systemctl. While vdsm is restarting (the exact timing depends on different factors but try to be as quick as possible), run: python -c 'from ovirt_hosted_engine_ha.lib import util as ohautil; cli = ohautil.connect_vdsm_json_rpc(logger=None, timeout=60); print(cli.getVdsHardwareInfo())' You shouldn't see any error. ok, vdsm-4.18.15.2-1.el7ev.x86_64
can't get problematic exception
# echo 'systemctl restart vdsmd' | at now ; python -c 'from ovirt_hosted_engine_ha.lib import util as ohautil; cli = ohautil.connect_vdsm_json_rpc(logger=None, timeout=60); print(cli.getVdsHardwareInfo())'
job 12 at Tue Oct 25 15:38:00 2016
{u'systemProductName': u'PowerEdge R210 II', u'systemUUID': u'4C4C4544-0058-3410-8058-C2C04F38354A', u'systemSerialNumber': u'BX4X85J', 'status': {'message': 'Done', 'code': 0}, u'systemManufacturer': u'Dell Inc.'}
|
Description of problem: If, for any reason, we have an issue gathering Host.getHardwareInfo info, the json rpc client hits an AttributeError exception: Reactor thread::DEBUG::2016-09-08 15:44:40,357::stompreactor::492::protocoldetector.StompDetector::(handle_socket) Stomp detected from ('::ffff:192.168.105.65', 59983) jsonrpc.Executor/0::INFO::2016-09-08 15:44:41,742::__init__::524::jsonrpc.JsonRpcServer::(_handle_request) In recovery, ignoring 'Host.getHardwareInfo' in bridge with {} jsonrpc.Executor/0::ERROR::2016-09-08 15:44:41,742::executor::193::Executor::(_execute_task) Unhandled exception in Task(callable=<functools.partial object at 0x25c5788>, timeout=None) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 191, in _execute_task task.callable() File "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 510, in _serveRequest response_log = "failed (error %s)" % (error.code,) AttributeError: 'dict' object has no attribute 'code' storageRefresh::DEBUG::2016-09-08 15:44:43,897::multipath::78::Storage.Misc.excCmd::(rescan) SUCCESS: <err> = ''; <rc> = 0 And the application (hosted-engine-setup in this case) gets no response at all: 2016-09-08 16:26:08 DEBUG otopi.context context._executeMethod:142 method exception Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/otopi/context.py", line 132, in _executeMethod method['method']() File "/usr/share/ovirt-hosted-engine-setup/scripts/../plugins/gr-he-setup/system/vdsmenv.py", line 157, in _misc timeout=ohostedcons.Const.VDSCLI_SSL_TIMEOUT, File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/util.py", line 207, in connect_vdsm_json_rpc hwinfo = cli.getVdsHardwareInfo() File "/usr/lib/python2.7/site-packages/vdsm/jsonrpcvdscli.py", line 153, in _callMethod raise JsonRpcNoResponseError(method) JsonRpcNoResponseError: [-32605] No response for JSON-RPC Host.getHardwareInfo request. Version-Release number of selected component (if applicable): How reproducible: ? Steps to Reproduce: 1. 2. 3. Actual results: AttributeError: 'dict' object has no attribute 'code' in VDSM logs, JsonRpcNoResponseError in the application (after a long timeout, 900 seconds). Expected results: The Host.getHardwareInfo fails with a clear error. Additional info: