1781892 – [4.4.0-6] failed to add secondary hosts(HA) with error "Unable to stop service vdsmd"

Bug 1781892 - [4.4.0-6] failed to add secondary hosts(HA) with error "Unable to stop service vdsmd"

Summary: [4.4.0-6] failed to add secondary hosts(HA) with error "Unable to stop servic...

Keywords:
Status:	CLOSED WORKSFORME
Alias:	None
Product:	ovirt-hosted-engine-ha
Classification:	oVirt
Component:	Broker
Sub Component:
Version:	2.4.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	urgent
Target Milestone:	ovirt-4.4.0
Target Release:	---
Assignee:	Yedidyah Bar David
QA Contact:	Kobi Hakimi
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-12-10 20:32 UTC by Martin Perina
Modified:	2019-12-12 06:35 UTC (History)
CC List:	10 users (show)
Fixed In Version:
Clone Of:	1778638
Environment:
Last Closed:	2019-12-12 06:35:17 UTC
oVirt Team:	Integration
Embargoed:
Flags:	mperina: ovirt-4.4?

Attachments	(Terms of Use)

Description Martin Perina 2019-12-10 20:32:21 UTC

+++ This bug was initially created as a clone of Bug #1778638 +++

Description of problem:
[4.4.0-6] failed to add secondary hosts(HA) with error "Unable to stop service vdsmd"

Version-Release number of selected component (if applicable):
ovirt-host-4.4.0-0.1.master.20190917090202.git914ae95.el8ev.x86_64
ovirt-host-deploy-common-1.9.0-0.0.master.20191106103900.git98d0c82.el8ev.noarch
ovirt-host-dependencies-4.4.0-0.1.master.20190917090202.git914ae95.el8ev.x86_64
python3-ovirt-host-deploy-1.9.0-0.0.master.20191106103900.git98d0c82.el8ev.noarch
ovirt-ansible-hosted-engine-setup-1.0.34-1.el8ev.noarch

How reproducible:
75%

Steps to Reproduce:
1. Deploy hosted engine(in my case over glusterfs)
2. Add secondary host(with hosted engine)

Actual results:
when I tried to add secondary hosts(HA) I failed in host installation with the following error:
Host host_mixed_2 installation failed. Failed to execute Ansible host-deploy role. Please check logs for more details: /var/log/ovirt-engine/host-deploy/ovirt-host-deploy-ansible-20191201085346-caracal05.lab.eng.tlv2.redhat.com-6567b489.log.

when I look at the ovirt-host-deploy-ansible-20191201085346-caracal05.lab.eng.tlv2.redhat.com-6567b489.log I can see that it failed in:
2019-12-01 08:55:40,910 p=ovirt u=4856 | RUNNING HANDLER [ovirt-host-deploy-vnc-certificates : stop vdsmd] **************
2019-12-01 08:55:51,090 p=ovirt u=4856 | fatal: [caracal05.lab.eng.tlv2.redhat.com]: FAILED! => {
    "changed": false
}
MSG:
Unable to stop service vdsmd: Job for vdsmd.service canceled.


Expected results:
Success to add the secondary hosts.

Additional info:
- when I tried with deploy of hosted engine over nfs I got the same result.
- one time one host succeeded and one failed and in other time 2 hosts failed to install.
- after a while when I set the host to maintenance and reinstall it with "Hosted Engine" the installation succeeded but the host have no HA capabilities.

--- Additional comment from Kobi Hakimi on 2019-12-02 09:54:16 CET ---



--- Additional comment from Sandro Bonazzola on 2019-12-02 16:05:23 CET ---

Moving to oVirt Engine based on "Failed to execute Ansible host-deploy role."

--- Additional comment from Martin Perina on 2019-12-02 16:26:33 CET ---

Ondro, could it be relevant to ansible-runner? The role hasn't been changed since January 2019 ...

--- Additional comment from Ondra Machacek on 2019-12-02 17:10:50 CET ---

No, this version don't use ansible-runner, but root cause I can't say without vdsm log.

--- Additional comment from Martin Perina on 2019-12-02 19:27:30 CET ---

/var/log/messages from the host attached to the bug is empty, is there any error in the file on the host directly? Or anything relevant in journal logs?

--- Additional comment from Kobi Hakimi on 2019-12-03 08:59:43 CET ---

in the host, the messages log is empty as well.
I collected journal logs please look in the attached journal_logs.tar.gz
the env is locked for investigation so you can look into it caracal05.lab.eng.tlv2.redhat.com 
please ping me or @jmacku for more details

--- Additional comment from Kobi Hakimi on 2019-12-03 09:00:17 CET ---



--- Additional comment from Kobi Hakimi on 2019-12-03 09:08:50 CET ---



--- Additional comment from Martin Perina on 2019-12-03 13:16:10 CET ---

Marcin, could you please take a look why vdsm cannot be stopped?

--- Additional comment from Martin Perina on 2019-12-06 17:46:51 CET ---

According to Marcin's investigation there are several different issues:

1. supervdsmd stop issue
    - there seems to be some kind of a race for svdsm.socket and sometimes when supervdsmd is shutting down, multiprocessing library tries to refers to the already-deleted socket
    - this results in raising and Exception and systemd restarts the service so the stop is unsuccessful and that's why host deploy fails
    - this is reproduceable only on hosts where issue 2. hasn't happened

2. vdsmd stop issue
    - we can see attempts to stop vdsmd within ansible host-deploy logs, but there is no trace that vdsmd was actually being stopped neither in vdsm logs nor journactl
    - this is reproduceable only on hosts where issue 1. hasn't happened

3. ovirt-ha-broker issue
    - it's failing all the time due to non-existent config entry 'metadata_image_UUID'
    - this issue has been reproduced on both hosts where issue 1. or issue 2. appeared

--- Additional comment from Yedidyah Bar David on 2019-12-09 11:20:10 CET ---

I can't find in any of the attached tar files logs for hosted-engine setup. Can you please attach these? Thanks.

--- Additional comment from Ondra Machacek on 2019-12-09 11:34:37 CET ---

Error: "Unable to stop service vdsmd: Job for vdsmd.service canceled." most probably means that long running vdsm stop job was replaced by some other vdsm start job. It could be caused by many different things like dependency/want/requires which replace the stop job. I don't know very well the systemd configuration of vdsm/supervdsm/libevirt, so I can't really say what is wrong there and what could cause calling vdsm start, while running vdsm stop.

--- Additional comment from Kobi Hakimi on 2019-12-09 11:59:45 CET ---

(In reply to Yedidyah Bar David from comment #11)
> I can't find in any of the attached tar files logs for hosted-engine setup.
> Can you please attach these? Thanks.

maybe the host deploy logs could help.
these logs located in:
engine-hosted-engine-09_ovirt-engine-logs > tmp > ovirt-logs-engine⁩ > ovirt-engine-logs > ⁨host-deploy

--- Additional comment from Marcin Sobczyk on 2019-12-09 12:39:03 CET ---

> 2. vdsmd stop issue

> Error: "Unable to stop service vdsmd: Job for vdsmd.service canceled." most probably means that long running vdsm stop job was replaced by some other vdsm start job

Given that we have a problem with 'ovirt-ha-agent' and 'ovirt-ha-broker', and 'ovirt-ha-agent' depending on 'vdsmd' [1], it's very probable that once we fix these issues, the vdsmd stop issue will be gone.

[1] https://github.com/oVirt/ovirt-hosted-engine-ha/blob/master/initscripts/ovirt-ha-agent.service.in

--- Additional comment from Marcin Sobczyk on 2019-12-09 12:43:45 CET ---

(In reply to Yedidyah Bar David from comment #11)
> I can't find in any of the attached tar files logs for hosted-engine setup.
> Can you please attach these? Thanks.

If that's any help, journalctl is flooded with these failures alternating:

ovirt-ha-broker ovirt_hosted_engine_ha.broker.broker.Broker ERROR Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/broker/broker.py", line 64, in run
    self._storage_broker_instance = self._get_storage_broker()
  File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/broker/broker.py", line 143, in _get_storage_broker
    return storage_broker.StorageBroker()
  File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/broker/storage_broker.py", line 80, in __init__
    raise_on_none=True),
  File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/env/config.py", line 87, in get
    "'{0} can't be '{1}'".format(key, val)
ValueError: 'metadata_image_UUID can't be 'None'

and:

ovirt-ha-agent ovirt_hosted_engine_ha.agent.agent.Agent ERROR Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py", line 85, in start_monitor
    response = self._proxy.start_monitor(type, options)
  File "/usr/lib64/python3.6/xmlrpc/client.py", line 1112, in __call__
    return self.__send(self.__name, args)
  File "/usr/lib64/python3.6/xmlrpc/client.py", line 1452, in __request
    verbose=self.__verbose
  File "/usr/lib64/python3.6/xmlrpc/client.py", line 1154, in request
    return self.single_request(host, handler, request_body, verbose)
  File "/usr/lib64/python3.6/xmlrpc/client.py", line 1166, in single_request
    http_conn = self.send_request(host, handler, request_body, verbose)
  File "/usr/lib64/python3.6/xmlrpc/client.py", line 1279, in send_request
    self.send_content(connection, request_body)
  File "/usr/lib64/python3.6/xmlrpc/client.py", line 1309, in send_content
    connection.endheaders(request_body)
  File "/usr/lib64/python3.6/http/client.py", line 1249, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/usr/lib64/python3.6/http/client.py", line 1036, in _send_output
    self.send(msg)
  File "/usr/lib64/python3.6/http/client.py", line 974, in send
    self.connect()
  File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/lib/unixrpc.py", line 74, in connect
    self.sock.connect(base64.b16decode(self.host))
FileNotFoundError: [Errno 2] No such file or directory

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/agent/agent.py", line 131, in _run_agent
    return action(he)
  File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/agent/agent.py", line 55, in action_proper
    return he.start_monitoring()
  File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 437, in start_monitoring
    self._initialize_broker()
  File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 561, in _initialize_broker
    m.get('options', {}))
  File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py", line 91, in start_monitor
    ).format(t=type, o=options, e=e)
ovirt_hosted_engine_ha.lib.exceptions.RequestError: brokerlink - failed to start monitor via ovirt-ha-broker: [Errno 2] No such file or directory, [monitor: 'network', options: {'addr': None, 'network_test': None, 'tcp_t_address': None, 'tcp_t_port': None}]

Comment 1 Yedidyah Bar David 2019-12-11 07:16:15 UTC

I'd like to clarify something: I already downloaded all the attachments of bug 1778638, untarred, and grepped all for 'metadata_image_UUID'. This string appears in two files:

./hypervisor-caracal05_ovirt-engine-logs.tar.gz.d/var/tmp/ovirt-logs-hypervisor/hosted-engine-logs/broker.log
./journal_logs.tar.gz.d/journal_logs/journal_errors.log

This does not help. If indeed metadata_image_UUID is missing somewhere, it was supposed to be written during setup/deploy time. Without setup/deploy logs, I can't do much. I suggest to reproduce, and attach more logs, from all relevant machines (first host, second host, engine).

If the attached logs are from jenkins, I suggest to update the job to collect these always.

I am currently trying to reproduce on my own machines, will update if I find something.

Kobi, if this is reproducible, and happened again, please attach relevant logs, and/or let me access the machines. Thanks.

Comment 2 Michal Skrivanek 2019-12-11 09:15:35 UTC

note that this build has the bug in updating OVF, so if that's anything you'd require to be present in HE OVF updated by engine it's probably not there...

Comment 3 Kobi Hakimi 2019-12-11 11:55:37 UTC

I tried to reproduce it with the same (old) build 4.4.0-6 and the deploy succeeded :)
https://rhv-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/rhv-4.4-ge-flow-infra/125/console

so from my side, we can close this bug and in case we will see it again we can reopen it.

Comment 4 Yedidyah Bar David 2019-12-12 06:35:17 UTC

Very well. Closing for now. Please reopen if needed, and attach all relevant logs :-)

Note You need to log in before you can comment on or make changes to this bug.