Bug 1569593
Summary: | ERROR failed to retrieve Hosted Engine HA score '[Errno 2] No such file or directory' Is the Hosted Engine setup finished? | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | [oVirt] ovirt-hosted-engine-ha | Reporter: | Petr Kubica <pkubica> | ||||||
Component: | General | Assignee: | Milan Zamazal <mzamazal> | ||||||
Status: | CLOSED CURRENTRELEASE | QA Contact: | Nikolai Sednev <nsednev> | ||||||
Severity: | low | Docs Contact: | |||||||
Priority: | medium | ||||||||
Version: | --- | CC: | akrejcir, alexander, bugs, danken, dfediuck, dholler, emarcus, fromani, mburman, mgoldboi, michal.skrivanek, mike, obockows, pkubica, rbarry, robnunin, srosenbe, stirabos | ||||||
Target Milestone: | ovirt-4.4.0 | Keywords: | Automation, AutomationBlocker, EasyFix, Triaged | ||||||
Target Release: | --- | Flags: | rbarry:
ovirt-4.4?
|
||||||
Hardware: | Unspecified | ||||||||
OS: | Unspecified | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||
Doc Text: |
When deploying the self-hosted engine on a host, the Broker and Agent Services are brought down momentarily. When the VDSM service attempted to send a get_stats message before the services are restarted, the communication failed and the VDSM logged an error message.
In this release, such events now result in a warning, and are not flagged or logged as errors.
|
Story Points: | --- | ||||||
Clone Of: | |||||||||
: | 1656294 (view as bug list) | Environment: | |||||||
Last Closed: | 2020-05-20 19:59:57 UTC | Type: | Bug | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | SLA | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Bug Depends On: | 1795672 | ||||||||
Bug Blocks: | 1656294 | ||||||||
Attachments: |
|
Description
Petr Kubica
2018-04-19 14:39:50 UTC
Is hosted engine installed on the vdsm host with this issue? That is a difficult question. During the tests it can happen that the machine was removed and added as host without hosted engine deployment process (I mean that the host was added as host and not hosted-engine host). I run all tests again and check webadmin and all hosts. In engine there are crowns at all hosts but at first on lynx23 and lynx24 were for few seconds exclamation marks with "unable to compute HA score I'm attaching output of hosted-engine --vm-status from all hosts Created attachment 1425253 [details]
hosted-engine --vm-status
We really need a reproducer if we want to fix this, but looking at the report and the comment #3 I am not really sure I would consider this a bug.. User shouldn't see multiple printed exceptions in journal log. User should see only errors in this case. I cannot give you exact reproduction steps how to achieve this exception. I have only logs from automation from all machines which I posted in comment #1 I was asked by a customer to double check his upgraded RHV env (4.1 -> 4.2), there are some other issues, so I had to conduct similar upgrade in my test environment. After upgrading hypervisor from rhvh-4.1-0.20180126.0 to rhvh-4.2.3.0-0.20180518.0 I have problems with ovirt-ha-agent, which leads to ovirt-ha-broker and finally to vdsm, many errors are reported: ERROR (periodic/1) [root] failed to retrieve Hosted Engine HA score '[Errno 2] No such file or directory'Is the Hosted Engine setup finished? (api:196) *** Bug 1569601 has been marked as a duplicate of this bug. *** The error: ERROR (periodic/1) [root] failed to retrieve Hosted Engine HA score ... is logged when the ovirt-ha-broker service is not running, or is not ready to respond to requests. We could limit the frequency of these messages. The traceback beginning with 'WARN Worker blocked' is interesting. Is it somewhere in the provided logs? I didn't find it. For example in vdsm.log of lynx23 server in archive on 9924 line Petr, why is this an automation blocker? It just pollutes the log, no functionality is broken. It blocks the test about searching for tracebacks in journal log of vdsmd service (as there shouldn't be any tracebacks). Due to this bug we cannot find other exceptions which could be more important. Francesco, you mentioned that GSS asked for this traceback to stay visible iirc. Can we figure out how to make it work for us, QA and GSS? One option might be showing WARNING summary and DEBUG traceback. The traceback was implemented in Bug 1501012. (In reply to Martin Sivák from comment #13) > Francesco, you mentioned that GSS asked for this traceback to stay visible > iirc. Can we figure out how to make it work for us, QA and GSS? > > One option might be showing WARNING summary and DEBUG traceback. We can work about that. I for myself have no preferences, we "just" need consensus about a solution. I see this error during addition of additional ha-host to the engine: 2018-12-05 13:05:14,974+0200 ERROR (periodic/0) [root] failed to retrieve Hosted Engine HA score '[Errno 2] No such fi le or directory'Is the Hosted Engine setup finished? (api:196) 2018-12-05 13:05:15,428+0200 INFO (hsm/init) [storage.HSM] FINISH HSM init succeeded in 1.37 seconds (hsm:410) 2018-12-05 13:05:15,718+0200 INFO (jsonrpc/2) [jsonrpc.JsonRpcServer] RPC call Host.ping2 succeeded in 0.00 seconds ( __init__:573) 2018-12-05 13:05:15,725+0200 INFO (jsonrpc/3) [jsonrpc.JsonRpcServer] RPC call Host.ping2 succeeded in 0.00 seconds ( __init__:573) 2018-12-05 13:05:15,729+0200 INFO (jsonrpc/4) [vdsm.api] START getStorageDomainInfo(sdUUID=u'35c9172d-c2ca-407f-b54a- 67a4fb24b757', options=None) from=::1,60576, task_id=4af8dbf5-a1f3-4522-a803-2ca16b51dc88 (api:46) 2018-12-05 13:05:15,834+0200 INFO (jsonrpc/5) [jsonrpc.JsonRpcServer] RPC call Host.ping2 succeeded in 0.00 seconds ( __init__:573) 2018-12-05 13:05:15,894+0200 INFO (itmap/0) [IOProcessClient] (/yellow-vdsb.qa.lab.tlv.redhat.com:_Compute__NFS_nsedn ev__he__1) Starting client (__init__:308) 2018-12-05 13:05:15,904+0200 INFO (ioprocess/57753) [IOProcess] (/yellow-vdsb.qa.lab.tlv.redhat.com:_Compute__NFS_nse dnev__he__1) Starting ioprocess (__init__:434) 2018-12-05 13:05:15,909+0200 INFO (jsonrpc/4) [IOProcessClient] (35c9172d-c2ca-407f-b54a-67a4fb24b757) Starting clien t (__init__:308) 2018-12-05 13:05:15,920+0200 INFO (ioprocess/57760) [IOProcess] (35c9172d-c2ca-407f-b54a-67a4fb24b757) Starting ioprocess (__init__:434) 2018-12-05 13:05:15,926+0200 INFO (jsonrpc/4) [storage.StorageDomain] Removing remnants of deleted images [] (fileSD:734) 2018-12-05 13:05:15,926+0200 INFO (jsonrpc/4) [storage.StorageDomain] sdUUID=35c9172d-c2ca-407f-b54a-67a4fb24b757 (fileSD:534) 2018-12-05 13:05:15,934+0200 INFO (jsonrpc/4) [vdsm.api] FINISH getStorageDomainInfo return={'info': {'uuid': u'35c9172d-c2ca-407f-b54a-67a4fb24b757', 'version': '4', 'role': 'Master', 'remotePath': 'yellow-vdsb.qa.lab.tlv.redhat.com:/Compute_NFS/nsednev_he_1', 'type': 'NFS', 'class': 'Data', 'pool': ['890ecd6b-6da7-4b1c-97d5-760239680b36'], 'name': 'hosted_storage'}} from=::1,60576, task_id=4af8dbf5-a1f3-4522-a803-2ca16b51dc88 (api:52) 2018-12-05 13:05:15,935+0200 INFO (jsonrpc/4) [jsonrpc.JsonRpcServer] RPC call StorageDomain.getInfo succeeded in 0.20 seconds (__init__:573) 2018-12-05 13:05:15,943+0200 INFO (jsonrpc/6) [jsonrpc.JsonRpcServer] RPC call Host.ping2 succeeded in 0.00 seconds (__init__:573) 2018-12-05 13:05:15,947+0200 INFO (jsonrpc/7) [vdsm.api] START getStorageDomainInfo(sdUUID=u'35c9172d-c2ca-407f-b54a-: Steps for reproduction: 1.Deploy HE over NFS on one ha-host. 2.Add additional regular host, not ha-host. 3.Set additional regular host in to maintenance. 4.Reinstall additional the non-ha-host as hosted engine host. 5.Check vdsm.log on reinstalled as ha-host. That part actually doesn't look bad to me, Nikolai -- no traceback which will confuse automation. Dominik - what's the timeout for blocking workers? It seems to me that simply setting an appropriate timeout in ovirt_hosted_engine_ha (which doesn't have one right now, or any exception handling at all that I can see) would stop blocking. Milan -- I know you've never touched this, but it's Python ;) Likely candidates seem to be a lack of any connection timeouts in ovirt_hosted_engine_ha/lib/unixrpc.py And: No exception handling at all in: ovirt_hosted_engine_ha/lib/brokerlink.py#L138 No need for an actual fix to this, since Martin indicated that the messages are harmless. Just trying to avoid blocking vdsm workers by throwing esxceptions around Answer here for documentation purpose a question received offline. Vdsm keeps two thread pools[1] to handle possibly blocking operation. One is JSON-RPC workers, another for periodic monitoring operations. These thread pools are smart, they can handle blocked workers and replace them, up to a configured high limit, so there always are available workers to perform tasks submitted to the pool. Tasks submitted to the pools always have a timeout. When timeout expires and the task is still running, the executor flags the slow worker, and immediately creates a new idle one unless the worker limit is reached. When eventually the slow worker completes the operation (if it does!), it checks for the "slow" flag. If set, the worker self-destructs. Otherwise, it put itself in the idle pool. +++ [1] these are the ones I'm aware of :) (In reply to Ryan Barry from comment #18) > That part actually doesn't look bad to me, Nikolai -- no traceback which > will confuse automation. > > Dominik - what's the timeout for blocking workers? It seems to me that > simply setting an appropriate timeout in ovirt_hosted_engine_ha (which > doesn't have one right now, or any exception handling at all that I can see) > would stop blocking. > Does comment 19 answers your question? Almost -- what's the timeout? (In reply to Ryan Barry from comment #21) > Almost -- what's the timeout? It depends. For the JSON-RPC tasks, it is tunable using the configuration value [rpc] worker_timeout default is 60 seconds For the virt periodic operations, the timeout is half of the periodic interval. These intervals are again tunable using the configuration file, the values are irs.vol_size_sample_interval vars.vm_sample_jobs_interval vars.vm_watermark_interval sampling.external_vm_lookup_interval vars.vm_kill_paused_time vars.vm_sample_interval This one -host sampling- has timeout equal to the period. Furthermore, no replacements attempts are to be made should the operation block for any reason: vars.host_sample_stats_interval Thanks Francesco. Milan, let's try to just beat the timeout. Setting a socket timeout of 20 seconds should be more than enough for HE, and also hide ugly exceptions Blocked calls cause the tracebacks even with oVirt gerrit 96618. I tried to add a timeout to get_stats() call but it didn't help either. Maybe I did something wrong or the timeout doesn't apply to Unix domain connections. I'm not sure it's worth further work. BTW I think the related Vdsm worker timeout is 15 seconds in HostMonitor, as seen in comment 0. Consistently reproduced during addition of second host as ha-host, just like described in comment #17. 2019-02-11 19:46:02,357+0200 ERROR (periodic/0) [root] failed to retrieve Hosted Engine HA score '[Errno 2] No such fi le or directory'Is the Hosted Engine setup finished? (api:191) 2019-02-11 19:46:02,500+0200 INFO (Reactor thread) [ProtocolDetector.AcceptorImpl] Accepted connection from ::1:38226 (protocoldetector:61) 2019-02-11 19:46:02,521+0200 INFO (Reactor thread) [ProtocolDetector.AcceptorImpl] Accepted connection from ::1:38228 (protocoldetector:61) 2019-02-11 19:46:02,523+0200 INFO (Reactor thread) [ProtocolDetector.Detector] Detected protocol stomp from ::1:38226 (protocoldetector:125) 2019-02-11 19:46:02,523+0200 INFO (Reactor thread) [Broker.StompAdapter] Processing CONNECT request (stompserver:95) 2019-02-11 19:46:02,529+0200 INFO (JsonRpc (StompReactor)) [Broker.StompAdapter] Subscribe command received (stompser ver:124) 2019-02-11 19:46:02,542+0200 INFO (Reactor thread) [ProtocolDetector.Detector] Detected protocol stomp from ::1:38228 (protocoldetector:125) 2019-02-11 19:46:02,542+0200 INFO (Reactor thread) [Broker.StompAdapter] Processing CONNECT request (stompserver:95) 2019-02-11 19:46:02,544+0200 INFO (JsonRpc (StompReactor)) [Broker.StompAdapter] Subscribe command received (stompser ver:124) 2019-02-11 19:46:02,570+0200 INFO (jsonrpc/0) [jsonrpc.JsonRpcServer] In recovery, ignoring 'Host.ping2' in bridge wi th {} (__init__:324) 2019-02-11 19:46:02,570+0200 INFO (jsonrpc/0) [jsonrpc.JsonRpcServer] RPC call Host.ping2 failed (error 99) in 0.00 s econds (__init__:312) 2019-02-11 19:46:02,584+0200 INFO (jsonrpc/1) [jsonrpc.JsonRpcServer] In recovery, ignoring 'Host.ping2' in bridge wi th {} (__init__:324) 2019-02-11 19:46:02,584+0200 INFO (jsonrpc/1) [jsonrpc.JsonRpcServer] RPC call Host.ping2 failed (error 99) in 0.00 s econds (__init__:312) 2019-02-11 19:46:03,073+0200 INFO (hsm/init) [storage.HSM] FINISH HSM init succeeded in 1.21 seconds (hsm:410) 2019-02-11 19:46:03,576+0200 INFO (jsonrpc/2) [jsonrpc.JsonRpcServer] RPC call Host.ping2 succeeded in 0.00 seconds ( __init__:312) 2019-02-11 19:46:03,597+0200 INFO (jsonrpc/3) [jsonrpc.JsonRpcServer] RPC call Host.ping2 succeeded in 0.00 seconds ( __init__:312) 2019-02-11 19:46:03,604+0200 INFO (jsonrpc/4) [jsonrpc.JsonRpcServer] RPC call Host.ping2 succeeded in 0.00 seconds ( __init__:312) 2019-02-11 19:46:03,608+0200 INFO (jsonrpc/5) [vdsm.api] START getStorageDomainInfo(sdUUID=u'da4b370f-2b0b-485a-853e- Created attachment 1530062 [details]
sosreport from alma04
Tested on: ovirt-hosted-engine-ha-2.3.1-1.el7ev.noarch ovirt-hosted-engine-setup-2.3.3-1.el7ev.noarch rhvm-appliance-4.3-20190201.0.el7.x86_64 Linux 3.10.0-957.5.1.el7.x86_64 #1 SMP Wed Dec 19 10:46:58 EST 2018 x86_64 x86_64 x86_64 GNU/Linux Red Hat Enterprise Linux Server release 7.6 (Maipo) I reviewed this issue and traced through the code. The vdsm calls into the ovirt-hosted-engine-ha project's BrokerLink object which attempts to send a get_stats request via the broker.socket file. When deploying the Hosted engine on the Host, if the ovirt-ha-agent and ovirt-ha-broker are running, they are brought down and the broker.socket file is deleted. At that moment, until the ovirt-ha-broker is restarted, the vdsm polling of the get_stats message will fail with the error: "failed to retrieve Hosted Engine HA score '[Errno 2] No such file or directory'Is the Hosted Engine setup finished?". Only when the ovirt-ha-broker is restarted is the broker.socket file recreated. Therefore the error is correct because from the time the broker.socket file is deleted when the ovirt-ha-broker service is terminated until the time it is restarted and creates the broker.socket file, the file does not exist in which case error 2 (errno.ENOENT) is correct. However after the ovirt-ha-broker service has started and created the file, the polling to get_stats in the vdsm may still fail with an error: "failed to retrieve Hosted Engine HA score" until the ovirt-ha-agent service is also launched so that it can listen to the broker connection requests. One solution may be not to stop and restart the services so that the services are up and running and the broker.socket file will continue to exist, but there may be a historical reason for why the services are restarted. Another option may be to detect if the services are running and to only poll for the get_stats when both services are running and the broker.socket file exists. This may require additional code affecting performance and may not be worth the effort. Another solution may be to not treat these diagnostics as errors or remove them all together, but then when there is a problem there would not be proper diagnostics. I might lean towards making them warnings and not errors, but more input from others would be welcome. On my opinion the better and safer option is simply to declass the message from error to warning. I still see this error right on newly deployed hosted engine environment right on first ha-host: 2019-06-23 16:58:32,289+0300 ERROR (periodic/2) [root] failed to retrieve Hosted Engine HA score '[Errno 2] No such fi le or directory'Is the Hosted Engine setup finished? (api:191) 2019-06-23 16:58:33,946+0300 INFO (jsonrpc/2) [api.host] START getAllVmStats() from=::1,37484 (api:48) 2019-06-23 16:58:33,947+0300 INFO (jsonrpc/2) [api.host] FINISH getAllVmStats return={'status': {'message': 'Done', ' code': 0}, 'statsList': (suppressed)} from=::1,37484 (api:54) 2019-06-23 16:58:33,947+0300 INFO (jsonrpc/2) [jsonrpc.JsonRpcServer] RPC call Host.getAllVmStats succeeded in 0.00 s econds (__init__:312) 2019-06-23 16:58:47,316+0300 INFO (periodic/3) [vdsm.api] START repoStats(domains=()) from=internal, task_id=c89996fb -f4b9-4e22-aca7-431b0d9c9834 (api:48) 2019-06-23 16:58:47,317+0300 INFO (periodic/3) [vdsm.api] FINISH repoStats return={u'ae57631d-65cb-49c6-b143-f5561bd0 8da4': {'code': 0, 'actual': True, 'version': 5, 'acquired': True, 'delay': '0.000359139', 'lastCheck': '2.3', 'valid' : True}} from=internal, task_id=c89996fb-f4b9-4e22-aca7-431b0d9c9834 (api:54) Components: ovirt-hosted-engine-ha-2.3.2-1.el7ev.noarch ovirt-hosted-engine-setup-2.3.10-1.el7ev.noarch Linux 3.10.0-989.el7.bz1721503.x86_64 #1 SMP Fri Jun 21 15:10:35 EDT 2019 x86_64 x86_64 x86_64 GNU/Linux Moving back to assigned. This was a fix within the vdsm module and was not included in the last tag v4.40.0. It should be tested in the next vdsm release. then why did you move it to post? we believe it's fixed, please retest already present in 4.4, just retest already Didn't seen reported error on any manually deployed RHEL based deployment. Moving to verified. Please reopen in case that you still getting the error. Tested on: NFS deployment on these components: rhvm-appliance.x86_64 2:4.4-20200123.0.el8ev rhv-4.4.0 sanlock-3.8.0-2.el8.x86_64 qemu-kvm-4.2.0-12.module+el8.2.0+5858+afd073bc.x86_64 vdsm-4.40.5-1.el8ev.x86_64 libvirt-client-6.0.0-7.module+el8.2.0+5869+c23fe68b.x86_64 ovirt-hosted-engine-setup-2.4.2-2.el8ev.noarch ovirt-hosted-engine-ha-2.4.2-1.el8ev.noarch Linux 4.18.0-183.el8.x86_64 #1 SMP Sun Feb 23 20:50:47 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux Red Hat Enterprise Linux release 8.2 Beta (Ootpa) Engine is Software Version:4.4.0-0.17.master.el7 Red Hat Enterprise Linux Server release 7.8 Beta (Maipo) Linux 3.10.0-1123.el7.x86_64 #1 SMP Tue Jan 14 03:44:38 EST 2020 x86_64 x86_64 x86_64 GNU/Linux This bugzilla is included in oVirt 4.4.0 release, published on May 20th 2020. Since the problem described in this bug report should be resolved in oVirt 4.4.0 release, it has been closed with a resolution of CURRENT RELEASE. If the solution does not work for you, please open a new bug report. |