Description of problem: Exception was discovered during automation testing. ERROR failed to retrieve Hosted Engine HA score '[Errno 2] No such file or directory' Is the Hosted Engine setup finished? WARN Worker blocked: <Worker name=periodic/0 running <Task <Operation action=<vdsm.virt.sampling.HostMonitor object at 0x7f4d64342150> at 0x7f4d64342190> timeout=15, duration=15 at 0x7f4d64178e50> task#=9 at 0x7f4d64376b50>, traceback: File: "/usr/lib64/python2.7/threading.py", line 785, in __bootstrap self.__bootstrap_inner() File: "/usr/lib64/python2.7/threading.py", line 812, in __bootstrap_inner self.run() File: "/usr/lib64/python2.7/threading.py", line 765, in run self.__target(*self.__args, **self.__kwargs) File: "/usr/lib/python2.7/site-packages/vdsm/common/concurrent.py", line 194, in run ret = func(*args, **kwargs) File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 301, in _run self._execute_task() File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 315, in _execute_task task() File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 391, in __call__ self._callable() File: "/usr/lib/python2.7/site-packages/vdsm/virt/periodic.py", line 220, in __call__ self._func() File: "/usr/lib/python2.7/site-packages/vdsm/virt/sampling.py", line 578, in __call__ stats = hostapi.get_stats(self._cif, self._samples.stats()) File: "/usr/lib/python2.7/site-packages/vdsm/host/api.py", line 77, in get_stats ret['haStats'] = _getHaInfo() File: "/usr/lib/python2.7/site-packages/vdsm/host/api.py", line 182, in _getHaInfo stats = instance.get_all_stats() File: "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/client/client.py", line 93, in get_all_stats stats = broker.get_stats_from_storage() File: "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py", line 135, in get_stats_from_storage result = self._proxy.get_stats() File: "/usr/lib64/python2.7/xmlrpclib.py", line 1233, in __call__ return self.__send(self.__name, args) File: "/usr/lib64/python2.7/xmlrpclib.py", line 1591, in __request verbose=self.__verbose File: "/usr/lib64/python2.7/xmlrpclib.py", line 1273, in request return self.single_request(host, handler, request_body, verbose) File: "/usr/lib64/python2.7/xmlrpclib.py", line 1303, in single_request response = h.getresponse(buffering=True) File: "/usr/lib64/python2.7/httplib.py", line 1113, in getresponse response.begin() File: "/usr/lib64/python2.7/httplib.py", line 444, in begin version, status, reason = self._read_status() File: "/usr/lib64/python2.7/httplib.py", line 400, in _read_status line = self.fp.readline(_MAXLINE + 1) File: "/usr/lib64/python2.7/socket.py", line 476, in readline data = self._sock.recv(self._rbufsize) Version-Release number of selected component (if applicable): vdsm-4.20.25-1.el7ev.x86_64 Steps to Reproduce: Unclear - it happens during some different tests Actual results: Exception in journal log of vdsmd service Expected results: No exceptions in journal log of vdsmd service
Is hosted engine installed on the vdsm host with this issue?
That is a difficult question. During the tests it can happen that the machine was removed and added as host without hosted engine deployment process (I mean that the host was added as host and not hosted-engine host). I run all tests again and check webadmin and all hosts. In engine there are crowns at all hosts but at first on lynx23 and lynx24 were for few seconds exclamation marks with "unable to compute HA score I'm attaching output of hosted-engine --vm-status from all hosts
Created attachment 1425253 [details] hosted-engine --vm-status
We really need a reproducer if we want to fix this, but looking at the report and the comment #3 I am not really sure I would consider this a bug..
User shouldn't see multiple printed exceptions in journal log. User should see only errors in this case. I cannot give you exact reproduction steps how to achieve this exception. I have only logs from automation from all machines which I posted in comment #1
I was asked by a customer to double check his upgraded RHV env (4.1 -> 4.2), there are some other issues, so I had to conduct similar upgrade in my test environment. After upgrading hypervisor from rhvh-4.1-0.20180126.0 to rhvh-4.2.3.0-0.20180518.0 I have problems with ovirt-ha-agent, which leads to ovirt-ha-broker and finally to vdsm, many errors are reported: ERROR (periodic/1) [root] failed to retrieve Hosted Engine HA score '[Errno 2] No such file or directory'Is the Hosted Engine setup finished? (api:196)
*** Bug 1569601 has been marked as a duplicate of this bug. ***
The error: ERROR (periodic/1) [root] failed to retrieve Hosted Engine HA score ... is logged when the ovirt-ha-broker service is not running, or is not ready to respond to requests. We could limit the frequency of these messages. The traceback beginning with 'WARN Worker blocked' is interesting. Is it somewhere in the provided logs? I didn't find it.
For example in vdsm.log of lynx23 server in archive on 9924 line
Petr, why is this an automation blocker? It just pollutes the log, no functionality is broken.
It blocks the test about searching for tracebacks in journal log of vdsmd service (as there shouldn't be any tracebacks). Due to this bug we cannot find other exceptions which could be more important.
Francesco, you mentioned that GSS asked for this traceback to stay visible iirc. Can we figure out how to make it work for us, QA and GSS? One option might be showing WARNING summary and DEBUG traceback.
The traceback was implemented in Bug 1501012.
(In reply to Martin Sivák from comment #13) > Francesco, you mentioned that GSS asked for this traceback to stay visible > iirc. Can we figure out how to make it work for us, QA and GSS? > > One option might be showing WARNING summary and DEBUG traceback. We can work about that. I for myself have no preferences, we "just" need consensus about a solution.
I see this error during addition of additional ha-host to the engine: 2018-12-05 13:05:14,974+0200 ERROR (periodic/0) [root] failed to retrieve Hosted Engine HA score '[Errno 2] No such fi le or directory'Is the Hosted Engine setup finished? (api:196) 2018-12-05 13:05:15,428+0200 INFO (hsm/init) [storage.HSM] FINISH HSM init succeeded in 1.37 seconds (hsm:410) 2018-12-05 13:05:15,718+0200 INFO (jsonrpc/2) [jsonrpc.JsonRpcServer] RPC call Host.ping2 succeeded in 0.00 seconds ( __init__:573) 2018-12-05 13:05:15,725+0200 INFO (jsonrpc/3) [jsonrpc.JsonRpcServer] RPC call Host.ping2 succeeded in 0.00 seconds ( __init__:573) 2018-12-05 13:05:15,729+0200 INFO (jsonrpc/4) [vdsm.api] START getStorageDomainInfo(sdUUID=u'35c9172d-c2ca-407f-b54a- 67a4fb24b757', options=None) from=::1,60576, task_id=4af8dbf5-a1f3-4522-a803-2ca16b51dc88 (api:46) 2018-12-05 13:05:15,834+0200 INFO (jsonrpc/5) [jsonrpc.JsonRpcServer] RPC call Host.ping2 succeeded in 0.00 seconds ( __init__:573) 2018-12-05 13:05:15,894+0200 INFO (itmap/0) [IOProcessClient] (/yellow-vdsb.qa.lab.tlv.redhat.com:_Compute__NFS_nsedn ev__he__1) Starting client (__init__:308) 2018-12-05 13:05:15,904+0200 INFO (ioprocess/57753) [IOProcess] (/yellow-vdsb.qa.lab.tlv.redhat.com:_Compute__NFS_nse dnev__he__1) Starting ioprocess (__init__:434) 2018-12-05 13:05:15,909+0200 INFO (jsonrpc/4) [IOProcessClient] (35c9172d-c2ca-407f-b54a-67a4fb24b757) Starting clien t (__init__:308) 2018-12-05 13:05:15,920+0200 INFO (ioprocess/57760) [IOProcess] (35c9172d-c2ca-407f-b54a-67a4fb24b757) Starting ioprocess (__init__:434) 2018-12-05 13:05:15,926+0200 INFO (jsonrpc/4) [storage.StorageDomain] Removing remnants of deleted images [] (fileSD:734) 2018-12-05 13:05:15,926+0200 INFO (jsonrpc/4) [storage.StorageDomain] sdUUID=35c9172d-c2ca-407f-b54a-67a4fb24b757 (fileSD:534) 2018-12-05 13:05:15,934+0200 INFO (jsonrpc/4) [vdsm.api] FINISH getStorageDomainInfo return={'info': {'uuid': u'35c9172d-c2ca-407f-b54a-67a4fb24b757', 'version': '4', 'role': 'Master', 'remotePath': 'yellow-vdsb.qa.lab.tlv.redhat.com:/Compute_NFS/nsednev_he_1', 'type': 'NFS', 'class': 'Data', 'pool': ['890ecd6b-6da7-4b1c-97d5-760239680b36'], 'name': 'hosted_storage'}} from=::1,60576, task_id=4af8dbf5-a1f3-4522-a803-2ca16b51dc88 (api:52) 2018-12-05 13:05:15,935+0200 INFO (jsonrpc/4) [jsonrpc.JsonRpcServer] RPC call StorageDomain.getInfo succeeded in 0.20 seconds (__init__:573) 2018-12-05 13:05:15,943+0200 INFO (jsonrpc/6) [jsonrpc.JsonRpcServer] RPC call Host.ping2 succeeded in 0.00 seconds (__init__:573) 2018-12-05 13:05:15,947+0200 INFO (jsonrpc/7) [vdsm.api] START getStorageDomainInfo(sdUUID=u'35c9172d-c2ca-407f-b54a-: Steps for reproduction: 1.Deploy HE over NFS on one ha-host. 2.Add additional regular host, not ha-host. 3.Set additional regular host in to maintenance. 4.Reinstall additional the non-ha-host as hosted engine host. 5.Check vdsm.log on reinstalled as ha-host.
That part actually doesn't look bad to me, Nikolai -- no traceback which will confuse automation. Dominik - what's the timeout for blocking workers? It seems to me that simply setting an appropriate timeout in ovirt_hosted_engine_ha (which doesn't have one right now, or any exception handling at all that I can see) would stop blocking. Milan -- I know you've never touched this, but it's Python ;) Likely candidates seem to be a lack of any connection timeouts in ovirt_hosted_engine_ha/lib/unixrpc.py And: No exception handling at all in: ovirt_hosted_engine_ha/lib/brokerlink.py#L138 No need for an actual fix to this, since Martin indicated that the messages are harmless. Just trying to avoid blocking vdsm workers by throwing esxceptions around
Answer here for documentation purpose a question received offline. Vdsm keeps two thread pools[1] to handle possibly blocking operation. One is JSON-RPC workers, another for periodic monitoring operations. These thread pools are smart, they can handle blocked workers and replace them, up to a configured high limit, so there always are available workers to perform tasks submitted to the pool. Tasks submitted to the pools always have a timeout. When timeout expires and the task is still running, the executor flags the slow worker, and immediately creates a new idle one unless the worker limit is reached. When eventually the slow worker completes the operation (if it does!), it checks for the "slow" flag. If set, the worker self-destructs. Otherwise, it put itself in the idle pool. +++ [1] these are the ones I'm aware of :)
(In reply to Ryan Barry from comment #18) > That part actually doesn't look bad to me, Nikolai -- no traceback which > will confuse automation. > > Dominik - what's the timeout for blocking workers? It seems to me that > simply setting an appropriate timeout in ovirt_hosted_engine_ha (which > doesn't have one right now, or any exception handling at all that I can see) > would stop blocking. > Does comment 19 answers your question?
Almost -- what's the timeout?
(In reply to Ryan Barry from comment #21) > Almost -- what's the timeout? It depends. For the JSON-RPC tasks, it is tunable using the configuration value [rpc] worker_timeout default is 60 seconds For the virt periodic operations, the timeout is half of the periodic interval. These intervals are again tunable using the configuration file, the values are irs.vol_size_sample_interval vars.vm_sample_jobs_interval vars.vm_watermark_interval sampling.external_vm_lookup_interval vars.vm_kill_paused_time vars.vm_sample_interval This one -host sampling- has timeout equal to the period. Furthermore, no replacements attempts are to be made should the operation block for any reason: vars.host_sample_stats_interval
Thanks Francesco. Milan, let's try to just beat the timeout. Setting a socket timeout of 20 seconds should be more than enough for HE, and also hide ugly exceptions
Blocked calls cause the tracebacks even with oVirt gerrit 96618. I tried to add a timeout to get_stats() call but it didn't help either. Maybe I did something wrong or the timeout doesn't apply to Unix domain connections. I'm not sure it's worth further work. BTW I think the related Vdsm worker timeout is 15 seconds in HostMonitor, as seen in comment 0.
Consistently reproduced during addition of second host as ha-host, just like described in comment #17. 2019-02-11 19:46:02,357+0200 ERROR (periodic/0) [root] failed to retrieve Hosted Engine HA score '[Errno 2] No such fi le or directory'Is the Hosted Engine setup finished? (api:191) 2019-02-11 19:46:02,500+0200 INFO (Reactor thread) [ProtocolDetector.AcceptorImpl] Accepted connection from ::1:38226 (protocoldetector:61) 2019-02-11 19:46:02,521+0200 INFO (Reactor thread) [ProtocolDetector.AcceptorImpl] Accepted connection from ::1:38228 (protocoldetector:61) 2019-02-11 19:46:02,523+0200 INFO (Reactor thread) [ProtocolDetector.Detector] Detected protocol stomp from ::1:38226 (protocoldetector:125) 2019-02-11 19:46:02,523+0200 INFO (Reactor thread) [Broker.StompAdapter] Processing CONNECT request (stompserver:95) 2019-02-11 19:46:02,529+0200 INFO (JsonRpc (StompReactor)) [Broker.StompAdapter] Subscribe command received (stompser ver:124) 2019-02-11 19:46:02,542+0200 INFO (Reactor thread) [ProtocolDetector.Detector] Detected protocol stomp from ::1:38228 (protocoldetector:125) 2019-02-11 19:46:02,542+0200 INFO (Reactor thread) [Broker.StompAdapter] Processing CONNECT request (stompserver:95) 2019-02-11 19:46:02,544+0200 INFO (JsonRpc (StompReactor)) [Broker.StompAdapter] Subscribe command received (stompser ver:124) 2019-02-11 19:46:02,570+0200 INFO (jsonrpc/0) [jsonrpc.JsonRpcServer] In recovery, ignoring 'Host.ping2' in bridge wi th {} (__init__:324) 2019-02-11 19:46:02,570+0200 INFO (jsonrpc/0) [jsonrpc.JsonRpcServer] RPC call Host.ping2 failed (error 99) in 0.00 s econds (__init__:312) 2019-02-11 19:46:02,584+0200 INFO (jsonrpc/1) [jsonrpc.JsonRpcServer] In recovery, ignoring 'Host.ping2' in bridge wi th {} (__init__:324) 2019-02-11 19:46:02,584+0200 INFO (jsonrpc/1) [jsonrpc.JsonRpcServer] RPC call Host.ping2 failed (error 99) in 0.00 s econds (__init__:312) 2019-02-11 19:46:03,073+0200 INFO (hsm/init) [storage.HSM] FINISH HSM init succeeded in 1.21 seconds (hsm:410) 2019-02-11 19:46:03,576+0200 INFO (jsonrpc/2) [jsonrpc.JsonRpcServer] RPC call Host.ping2 succeeded in 0.00 seconds ( __init__:312) 2019-02-11 19:46:03,597+0200 INFO (jsonrpc/3) [jsonrpc.JsonRpcServer] RPC call Host.ping2 succeeded in 0.00 seconds ( __init__:312) 2019-02-11 19:46:03,604+0200 INFO (jsonrpc/4) [jsonrpc.JsonRpcServer] RPC call Host.ping2 succeeded in 0.00 seconds ( __init__:312) 2019-02-11 19:46:03,608+0200 INFO (jsonrpc/5) [vdsm.api] START getStorageDomainInfo(sdUUID=u'da4b370f-2b0b-485a-853e-
Created attachment 1530062 [details] sosreport from alma04
Tested on: ovirt-hosted-engine-ha-2.3.1-1.el7ev.noarch ovirt-hosted-engine-setup-2.3.3-1.el7ev.noarch rhvm-appliance-4.3-20190201.0.el7.x86_64 Linux 3.10.0-957.5.1.el7.x86_64 #1 SMP Wed Dec 19 10:46:58 EST 2018 x86_64 x86_64 x86_64 GNU/Linux Red Hat Enterprise Linux Server release 7.6 (Maipo)
I reviewed this issue and traced through the code. The vdsm calls into the ovirt-hosted-engine-ha project's BrokerLink object which attempts to send a get_stats request via the broker.socket file. When deploying the Hosted engine on the Host, if the ovirt-ha-agent and ovirt-ha-broker are running, they are brought down and the broker.socket file is deleted. At that moment, until the ovirt-ha-broker is restarted, the vdsm polling of the get_stats message will fail with the error: "failed to retrieve Hosted Engine HA score '[Errno 2] No such file or directory'Is the Hosted Engine setup finished?". Only when the ovirt-ha-broker is restarted is the broker.socket file recreated. Therefore the error is correct because from the time the broker.socket file is deleted when the ovirt-ha-broker service is terminated until the time it is restarted and creates the broker.socket file, the file does not exist in which case error 2 (errno.ENOENT) is correct. However after the ovirt-ha-broker service has started and created the file, the polling to get_stats in the vdsm may still fail with an error: "failed to retrieve Hosted Engine HA score" until the ovirt-ha-agent service is also launched so that it can listen to the broker connection requests. One solution may be not to stop and restart the services so that the services are up and running and the broker.socket file will continue to exist, but there may be a historical reason for why the services are restarted. Another option may be to detect if the services are running and to only poll for the get_stats when both services are running and the broker.socket file exists. This may require additional code affecting performance and may not be worth the effort. Another solution may be to not treat these diagnostics as errors or remove them all together, but then when there is a problem there would not be proper diagnostics. I might lean towards making them warnings and not errors, but more input from others would be welcome.
On my opinion the better and safer option is simply to declass the message from error to warning.
I still see this error right on newly deployed hosted engine environment right on first ha-host: 2019-06-23 16:58:32,289+0300 ERROR (periodic/2) [root] failed to retrieve Hosted Engine HA score '[Errno 2] No such fi le or directory'Is the Hosted Engine setup finished? (api:191) 2019-06-23 16:58:33,946+0300 INFO (jsonrpc/2) [api.host] START getAllVmStats() from=::1,37484 (api:48) 2019-06-23 16:58:33,947+0300 INFO (jsonrpc/2) [api.host] FINISH getAllVmStats return={'status': {'message': 'Done', ' code': 0}, 'statsList': (suppressed)} from=::1,37484 (api:54) 2019-06-23 16:58:33,947+0300 INFO (jsonrpc/2) [jsonrpc.JsonRpcServer] RPC call Host.getAllVmStats succeeded in 0.00 s econds (__init__:312) 2019-06-23 16:58:47,316+0300 INFO (periodic/3) [vdsm.api] START repoStats(domains=()) from=internal, task_id=c89996fb -f4b9-4e22-aca7-431b0d9c9834 (api:48) 2019-06-23 16:58:47,317+0300 INFO (periodic/3) [vdsm.api] FINISH repoStats return={u'ae57631d-65cb-49c6-b143-f5561bd0 8da4': {'code': 0, 'actual': True, 'version': 5, 'acquired': True, 'delay': '0.000359139', 'lastCheck': '2.3', 'valid' : True}} from=internal, task_id=c89996fb-f4b9-4e22-aca7-431b0d9c9834 (api:54) Components: ovirt-hosted-engine-ha-2.3.2-1.el7ev.noarch ovirt-hosted-engine-setup-2.3.10-1.el7ev.noarch Linux 3.10.0-989.el7.bz1721503.x86_64 #1 SMP Fri Jun 21 15:10:35 EDT 2019 x86_64 x86_64 x86_64 GNU/Linux Moving back to assigned.
This was a fix within the vdsm module and was not included in the last tag v4.40.0. It should be tested in the next vdsm release.
then why did you move it to post?
we believe it's fixed, please retest
already present in 4.4, just retest already
Didn't seen reported error on any manually deployed RHEL based deployment. Moving to verified. Please reopen in case that you still getting the error. Tested on: NFS deployment on these components: rhvm-appliance.x86_64 2:4.4-20200123.0.el8ev rhv-4.4.0 sanlock-3.8.0-2.el8.x86_64 qemu-kvm-4.2.0-12.module+el8.2.0+5858+afd073bc.x86_64 vdsm-4.40.5-1.el8ev.x86_64 libvirt-client-6.0.0-7.module+el8.2.0+5869+c23fe68b.x86_64 ovirt-hosted-engine-setup-2.4.2-2.el8ev.noarch ovirt-hosted-engine-ha-2.4.2-1.el8ev.noarch Linux 4.18.0-183.el8.x86_64 #1 SMP Sun Feb 23 20:50:47 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux Red Hat Enterprise Linux release 8.2 Beta (Ootpa) Engine is Software Version:4.4.0-0.17.master.el7 Red Hat Enterprise Linux Server release 7.8 Beta (Maipo) Linux 3.10.0-1123.el7.x86_64 #1 SMP Tue Jan 14 03:44:38 EST 2020 x86_64 x86_64 x86_64 GNU/Linux
This bugzilla is included in oVirt 4.4.0 release, published on May 20th 2020. Since the problem described in this bug report should be resolved in oVirt 4.4.0 release, it has been closed with a resolution of CURRENT RELEASE. If the solution does not work for you, please open a new bug report.