Bug 1778638 - [4.4.0-6] failed to add secondary hosts(HA) with error "Unable to stop service supervdsmd"
Summary: [4.4.0-6] failed to add secondary hosts(HA) with error "Unable to stop servic...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: vdsm
Classification: oVirt
Component: SuperVDSM
Version: 4.40.0
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: ovirt-4.4.0
: 4.40.7
Assignee: Marcin Sobczyk
QA Contact: Petr Matyáš
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-12-02 08:53 UTC by Kobi Hakimi
Modified: 2020-05-20 20:01 UTC (History)
12 users (show)

Fixed In Version: vdsm-4.40.7
Clone Of:
: 1781892 (view as bug list)
Environment:
Last Closed: 2020-05-20 20:01:29 UTC
oVirt Team: Infra
Embargoed:
dagur: needinfo+
pm-rhel: ovirt-4.4+
sbonazzo: blocker?


Attachments (Terms of Use)
hypervisor-caracal05_ovirt-engine-logs (219.00 KB, application/gzip)
2019-12-02 08:53 UTC, Kobi Hakimi
no flags Details
engine-hosted-engine-09_ovirt-engine-logs (1.80 MB, application/gzip)
2019-12-02 08:54 UTC, Kobi Hakimi
no flags Details
journal logs - correct files (3.11 MB, application/gzip)
2019-12-03 08:08 UTC, Kobi Hakimi
no flags Details


Links
System ID Private Priority Status Summary Last Updated
oVirt gerrit 105399 0 master MERGED supervdsm: Cleaner shutdown procedure 2020-11-06 00:16:19 UTC

Description Kobi Hakimi 2019-12-02 08:53:21 UTC
Created attachment 1641281 [details]
hypervisor-caracal05_ovirt-engine-logs

Description of problem:
[4.4.0-6] failed to add secondary hosts(HA) with error "Unable to stop service vdsmd"

Version-Release number of selected component (if applicable):
ovirt-host-4.4.0-0.1.master.20190917090202.git914ae95.el8ev.x86_64
ovirt-host-deploy-common-1.9.0-0.0.master.20191106103900.git98d0c82.el8ev.noarch
ovirt-host-dependencies-4.4.0-0.1.master.20190917090202.git914ae95.el8ev.x86_64
python3-ovirt-host-deploy-1.9.0-0.0.master.20191106103900.git98d0c82.el8ev.noarch
ovirt-ansible-hosted-engine-setup-1.0.34-1.el8ev.noarch

How reproducible:
75%

Steps to Reproduce:
1. Deploy hosted engine(in my case over glusterfs)
2. Add secondary host(with hosted engine)

Actual results:
when I tried to add secondary hosts(HA) I failed in host installation with the following error:
Host host_mixed_2 installation failed. Failed to execute Ansible host-deploy role. Please check logs for more details: /var/log/ovirt-engine/host-deploy/ovirt-host-deploy-ansible-20191201085346-caracal05.lab.eng.tlv2.redhat.com-6567b489.log.

when I look at the ovirt-host-deploy-ansible-20191201085346-caracal05.lab.eng.tlv2.redhat.com-6567b489.log I can see that it failed in:
2019-12-01 08:55:40,910 p=ovirt u=4856 | RUNNING HANDLER [ovirt-host-deploy-vnc-certificates : stop vdsmd] **************
2019-12-01 08:55:51,090 p=ovirt u=4856 | fatal: [caracal05.lab.eng.tlv2.redhat.com]: FAILED! => {
    "changed": false
}
MSG:
Unable to stop service vdsmd: Job for vdsmd.service canceled.


Expected results:
Success to add the secondary hosts.

Additional info:
- when I tried with deploy of hosted engine over nfs I got the same result.
- one time one host succeeded and one failed and in other time 2 hosts failed to install.
- after a while when I set the host to maintenance and reinstall it with "Hosted Engine" the installation succeeded but the host have no HA capabilities.

Comment 1 Kobi Hakimi 2019-12-02 08:54:16 UTC
Created attachment 1641282 [details]
engine-hosted-engine-09_ovirt-engine-logs

Comment 2 Sandro Bonazzola 2019-12-02 15:05:23 UTC
Moving to oVirt Engine based on "Failed to execute Ansible host-deploy role."

Comment 3 Martin Perina 2019-12-02 15:26:33 UTC
Ondro, could it be relevant to ansible-runner? The role hasn't been changed since January 2019 ...

Comment 4 Ondra Machacek 2019-12-02 16:10:50 UTC
No, this version don't use ansible-runner, but root cause I can't say without vdsm log.

Comment 5 Martin Perina 2019-12-02 18:27:30 UTC
/var/log/messages from the host attached to the bug is empty, is there any error in the file on the host directly? Or anything relevant in journal logs?

Comment 6 Kobi Hakimi 2019-12-03 07:59:43 UTC
in the host, the messages log is empty as well.
I collected journal logs please look in the attached journal_logs.tar.gz
the env is locked for investigation so you can look into it caracal05.lab.eng.tlv2.redhat.com 
please ping me or @jmacku for more details

Comment 8 Kobi Hakimi 2019-12-03 08:08:50 UTC
Created attachment 1641579 [details]
journal logs - correct files

Comment 9 Martin Perina 2019-12-03 12:16:10 UTC
Marcin, could you please take a look why vdsm cannot be stopped?

Comment 10 Martin Perina 2019-12-06 16:46:51 UTC
According to Marcin's investigation there are several different issues:

1. supervdsmd stop issue
    - there seems to be some kind of a race for svdsm.socket and sometimes when supervdsmd is shutting down, multiprocessing library tries to refers to the already-deleted socket
    - this results in raising and Exception and systemd restarts the service so the stop is unsuccessful and that's why host deploy fails
    - this is reproduceable only on hosts where issue 2. hasn't happened

2. vdsmd stop issue
    - we can see attempts to stop vdsmd within ansible host-deploy logs, but there is no trace that vdsmd was actually being stopped neither in vdsm logs nor journactl
    - this is reproduceable only on hosts where issue 1. hasn't happened

3. ovirt-ha-broker issue
    - it's failing all the time due to non-existent config entry 'metadata_image_UUID'
    - this issue has been reproduced on both hosts where issue 1. or issue 2. appeared

Comment 11 Yedidyah Bar David 2019-12-09 10:20:10 UTC
I can't find in any of the attached tar files logs for hosted-engine setup. Can you please attach these? Thanks.

Comment 12 Ondra Machacek 2019-12-09 10:34:37 UTC
Error: "Unable to stop service vdsmd: Job for vdsmd.service canceled." most probably means that long running vdsm stop job was replaced by some other vdsm start job. It could be caused by many different things like dependency/want/requires which replace the stop job. I don't know very well the systemd configuration of vdsm/supervdsm/libevirt, so I can't really say what is wrong there and what could cause calling vdsm start, while running vdsm stop.

Comment 13 Kobi Hakimi 2019-12-09 10:59:45 UTC
(In reply to Yedidyah Bar David from comment #11)
> I can't find in any of the attached tar files logs for hosted-engine setup.
> Can you please attach these? Thanks.

maybe the host deploy logs could help.
these logs located in:
engine-hosted-engine-09_ovirt-engine-logs > tmp > ovirt-logs-engine⁩ > ovirt-engine-logs > ⁨host-deploy

Comment 14 Marcin Sobczyk 2019-12-09 11:39:03 UTC
> 2. vdsmd stop issue

> Error: "Unable to stop service vdsmd: Job for vdsmd.service canceled." most probably means that long running vdsm stop job was replaced by some other vdsm start job

Given that we have a problem with 'ovirt-ha-agent' and 'ovirt-ha-broker', and 'ovirt-ha-agent' depending on 'vdsmd' [1], it's very probable that once we fix these issues, the vdsmd stop issue will be gone.

[1] https://github.com/oVirt/ovirt-hosted-engine-ha/blob/master/initscripts/ovirt-ha-agent.service.in

Comment 15 Marcin Sobczyk 2019-12-09 11:43:45 UTC
(In reply to Yedidyah Bar David from comment #11)
> I can't find in any of the attached tar files logs for hosted-engine setup.
> Can you please attach these? Thanks.

If that's any help, journalctl is flooded with these failures alternating:

ovirt-ha-broker ovirt_hosted_engine_ha.broker.broker.Broker ERROR Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/broker/broker.py", line 64, in run
    self._storage_broker_instance = self._get_storage_broker()
  File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/broker/broker.py", line 143, in _get_storage_broker
    return storage_broker.StorageBroker()
  File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/broker/storage_broker.py", line 80, in __init__
    raise_on_none=True),
  File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/env/config.py", line 87, in get
    "'{0} can't be '{1}'".format(key, val)
ValueError: 'metadata_image_UUID can't be 'None'

and:

ovirt-ha-agent ovirt_hosted_engine_ha.agent.agent.Agent ERROR Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py", line 85, in start_monitor
    response = self._proxy.start_monitor(type, options)
  File "/usr/lib64/python3.6/xmlrpc/client.py", line 1112, in __call__
    return self.__send(self.__name, args)
  File "/usr/lib64/python3.6/xmlrpc/client.py", line 1452, in __request
    verbose=self.__verbose
  File "/usr/lib64/python3.6/xmlrpc/client.py", line 1154, in request
    return self.single_request(host, handler, request_body, verbose)
  File "/usr/lib64/python3.6/xmlrpc/client.py", line 1166, in single_request
    http_conn = self.send_request(host, handler, request_body, verbose)
  File "/usr/lib64/python3.6/xmlrpc/client.py", line 1279, in send_request
    self.send_content(connection, request_body)
  File "/usr/lib64/python3.6/xmlrpc/client.py", line 1309, in send_content
    connection.endheaders(request_body)
  File "/usr/lib64/python3.6/http/client.py", line 1249, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/usr/lib64/python3.6/http/client.py", line 1036, in _send_output
    self.send(msg)
  File "/usr/lib64/python3.6/http/client.py", line 974, in send
    self.connect()
  File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/lib/unixrpc.py", line 74, in connect
    self.sock.connect(base64.b16decode(self.host))
FileNotFoundError: [Errno 2] No such file or directory

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/agent/agent.py", line 131, in _run_agent
    return action(he)
  File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/agent/agent.py", line 55, in action_proper
    return he.start_monitoring()
  File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 437, in start_monitoring
    self._initialize_broker()
  File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 561, in _initialize_broker
    m.get('options', {}))
  File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py", line 91, in start_monitor
    ).format(t=type, o=options, e=e)
ovirt_hosted_engine_ha.lib.exceptions.RequestError: brokerlink - failed to start monitor via ovirt-ha-broker: [Errno 2] No such file or directory, [monitor: 'network', options: {'addr': None, 'network_test': None, 'tcp_t_address': None, 'tcp_t_port': None}]

Comment 16 Martin Perina 2019-12-10 20:36:42 UTC
I've cloned the bug to BZ1781892 to track issue in ovirt-ha-broker. As the fix for VDSM is merged, moving to MODIFIED

Comment 20 Lukas Svaty 2020-01-30 13:16:22 UTC
Should be fixed in 2.4.2 version of ovirt-hosted-engine-setup

Comment 24 Petr Matyáš 2020-03-10 15:45:41 UTC
I don't think this is good to verify before we get RHEL 8 appliance.

Comment 25 Petr Matyáš 2020-03-27 11:35:51 UTC
Verified on vdsm-4.40.9-1.el8ev.x86_64

Comment 26 Sandro Bonazzola 2020-05-20 20:01:29 UTC
This bugzilla is included in oVirt 4.4.0 release, published on May 20th 2020.

Since the problem described in this bug report should be
resolved in oVirt 4.4.0 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.