Created attachment 1641281 [details] hypervisor-caracal05_ovirt-engine-logs Description of problem: [4.4.0-6] failed to add secondary hosts(HA) with error "Unable to stop service vdsmd" Version-Release number of selected component (if applicable): ovirt-host-4.4.0-0.1.master.20190917090202.git914ae95.el8ev.x86_64 ovirt-host-deploy-common-1.9.0-0.0.master.20191106103900.git98d0c82.el8ev.noarch ovirt-host-dependencies-4.4.0-0.1.master.20190917090202.git914ae95.el8ev.x86_64 python3-ovirt-host-deploy-1.9.0-0.0.master.20191106103900.git98d0c82.el8ev.noarch ovirt-ansible-hosted-engine-setup-1.0.34-1.el8ev.noarch How reproducible: 75% Steps to Reproduce: 1. Deploy hosted engine(in my case over glusterfs) 2. Add secondary host(with hosted engine) Actual results: when I tried to add secondary hosts(HA) I failed in host installation with the following error: Host host_mixed_2 installation failed. Failed to execute Ansible host-deploy role. Please check logs for more details: /var/log/ovirt-engine/host-deploy/ovirt-host-deploy-ansible-20191201085346-caracal05.lab.eng.tlv2.redhat.com-6567b489.log. when I look at the ovirt-host-deploy-ansible-20191201085346-caracal05.lab.eng.tlv2.redhat.com-6567b489.log I can see that it failed in: 2019-12-01 08:55:40,910 p=ovirt u=4856 | RUNNING HANDLER [ovirt-host-deploy-vnc-certificates : stop vdsmd] ************** 2019-12-01 08:55:51,090 p=ovirt u=4856 | fatal: [caracal05.lab.eng.tlv2.redhat.com]: FAILED! => { "changed": false } MSG: Unable to stop service vdsmd: Job for vdsmd.service canceled. Expected results: Success to add the secondary hosts. Additional info: - when I tried with deploy of hosted engine over nfs I got the same result. - one time one host succeeded and one failed and in other time 2 hosts failed to install. - after a while when I set the host to maintenance and reinstall it with "Hosted Engine" the installation succeeded but the host have no HA capabilities.
Created attachment 1641282 [details] engine-hosted-engine-09_ovirt-engine-logs
Moving to oVirt Engine based on "Failed to execute Ansible host-deploy role."
Ondro, could it be relevant to ansible-runner? The role hasn't been changed since January 2019 ...
No, this version don't use ansible-runner, but root cause I can't say without vdsm log.
/var/log/messages from the host attached to the bug is empty, is there any error in the file on the host directly? Or anything relevant in journal logs?
in the host, the messages log is empty as well. I collected journal logs please look in the attached journal_logs.tar.gz the env is locked for investigation so you can look into it caracal05.lab.eng.tlv2.redhat.com please ping me or @jmacku for more details
Created attachment 1641579 [details] journal logs - correct files
Marcin, could you please take a look why vdsm cannot be stopped?
According to Marcin's investigation there are several different issues: 1. supervdsmd stop issue - there seems to be some kind of a race for svdsm.socket and sometimes when supervdsmd is shutting down, multiprocessing library tries to refers to the already-deleted socket - this results in raising and Exception and systemd restarts the service so the stop is unsuccessful and that's why host deploy fails - this is reproduceable only on hosts where issue 2. hasn't happened 2. vdsmd stop issue - we can see attempts to stop vdsmd within ansible host-deploy logs, but there is no trace that vdsmd was actually being stopped neither in vdsm logs nor journactl - this is reproduceable only on hosts where issue 1. hasn't happened 3. ovirt-ha-broker issue - it's failing all the time due to non-existent config entry 'metadata_image_UUID' - this issue has been reproduced on both hosts where issue 1. or issue 2. appeared
I can't find in any of the attached tar files logs for hosted-engine setup. Can you please attach these? Thanks.
Error: "Unable to stop service vdsmd: Job for vdsmd.service canceled." most probably means that long running vdsm stop job was replaced by some other vdsm start job. It could be caused by many different things like dependency/want/requires which replace the stop job. I don't know very well the systemd configuration of vdsm/supervdsm/libevirt, so I can't really say what is wrong there and what could cause calling vdsm start, while running vdsm stop.
(In reply to Yedidyah Bar David from comment #11) > I can't find in any of the attached tar files logs for hosted-engine setup. > Can you please attach these? Thanks. maybe the host deploy logs could help. these logs located in: engine-hosted-engine-09_ovirt-engine-logs > tmp > ovirt-logs-engine > ovirt-engine-logs > host-deploy
> 2. vdsmd stop issue > Error: "Unable to stop service vdsmd: Job for vdsmd.service canceled." most probably means that long running vdsm stop job was replaced by some other vdsm start job Given that we have a problem with 'ovirt-ha-agent' and 'ovirt-ha-broker', and 'ovirt-ha-agent' depending on 'vdsmd' [1], it's very probable that once we fix these issues, the vdsmd stop issue will be gone. [1] https://github.com/oVirt/ovirt-hosted-engine-ha/blob/master/initscripts/ovirt-ha-agent.service.in
(In reply to Yedidyah Bar David from comment #11) > I can't find in any of the attached tar files logs for hosted-engine setup. > Can you please attach these? Thanks. If that's any help, journalctl is flooded with these failures alternating: ovirt-ha-broker ovirt_hosted_engine_ha.broker.broker.Broker ERROR Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/broker/broker.py", line 64, in run self._storage_broker_instance = self._get_storage_broker() File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/broker/broker.py", line 143, in _get_storage_broker return storage_broker.StorageBroker() File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/broker/storage_broker.py", line 80, in __init__ raise_on_none=True), File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/env/config.py", line 87, in get "'{0} can't be '{1}'".format(key, val) ValueError: 'metadata_image_UUID can't be 'None' and: ovirt-ha-agent ovirt_hosted_engine_ha.agent.agent.Agent ERROR Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py", line 85, in start_monitor response = self._proxy.start_monitor(type, options) File "/usr/lib64/python3.6/xmlrpc/client.py", line 1112, in __call__ return self.__send(self.__name, args) File "/usr/lib64/python3.6/xmlrpc/client.py", line 1452, in __request verbose=self.__verbose File "/usr/lib64/python3.6/xmlrpc/client.py", line 1154, in request return self.single_request(host, handler, request_body, verbose) File "/usr/lib64/python3.6/xmlrpc/client.py", line 1166, in single_request http_conn = self.send_request(host, handler, request_body, verbose) File "/usr/lib64/python3.6/xmlrpc/client.py", line 1279, in send_request self.send_content(connection, request_body) File "/usr/lib64/python3.6/xmlrpc/client.py", line 1309, in send_content connection.endheaders(request_body) File "/usr/lib64/python3.6/http/client.py", line 1249, in endheaders self._send_output(message_body, encode_chunked=encode_chunked) File "/usr/lib64/python3.6/http/client.py", line 1036, in _send_output self.send(msg) File "/usr/lib64/python3.6/http/client.py", line 974, in send self.connect() File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/lib/unixrpc.py", line 74, in connect self.sock.connect(base64.b16decode(self.host)) FileNotFoundError: [Errno 2] No such file or directory During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/agent/agent.py", line 131, in _run_agent return action(he) File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/agent/agent.py", line 55, in action_proper return he.start_monitoring() File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 437, in start_monitoring self._initialize_broker() File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 561, in _initialize_broker m.get('options', {})) File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py", line 91, in start_monitor ).format(t=type, o=options, e=e) ovirt_hosted_engine_ha.lib.exceptions.RequestError: brokerlink - failed to start monitor via ovirt-ha-broker: [Errno 2] No such file or directory, [monitor: 'network', options: {'addr': None, 'network_test': None, 'tcp_t_address': None, 'tcp_t_port': None}]
I've cloned the bug to BZ1781892 to track issue in ovirt-ha-broker. As the fix for VDSM is merged, moving to MODIFIED
Should be fixed in 2.4.2 version of ovirt-hosted-engine-setup
I don't think this is good to verify before we get RHEL 8 appliance.
Verified on vdsm-4.40.9-1.el8ev.x86_64
This bugzilla is included in oVirt 4.4.0 release, published on May 20th 2020. Since the problem described in this bug report should be resolved in oVirt 4.4.0 release, it has been closed with a resolution of CURRENT RELEASE. If the solution does not work for you, please open a new bug report.