Bug 1795238 - ovirt_hosted_engine_ha exception: failed to start monitor via ovirt-ha-broker
Summary: ovirt_hosted_engine_ha exception: failed to start monitor via ovirt-ha-broker
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: ovirt-hosted-engine-ha
Classification: oVirt
Component: Broker
Version: 2.4.1
Hardware: All
OS: All
high
high
Target Milestone: ovirt-4.4.0
: ---
Assignee: Evgeny Slutsky
QA Contact: Nikolai Sednev
URL:
Whiteboard:
Depends On:
Blocks: 1665138
TreeView+ depends on / blocked
 
Reported: 2020-01-27 14:02 UTC by Lukas Svaty
Modified: 2020-05-20 20:03 UTC (History)
6 users (show)

Fixed In Version: ovirt-engine 4.4.0 alpha 551beb5b
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-05-20 20:03:17 UTC
oVirt Team: Integration
Embargoed:
sbonazzo: ovirt-4.4?
mtessun: planning_ack+
sbonazzo: devel_ack+
sbonazzo: testing_ack?


Attachments (Terms of Use)
lynx14 first ha host running the engine (18.59 MB, application/x-xz)
2020-01-27 14:49 UTC, Nikolai Sednev
no flags Details
lynx 16 second ha host which did not succeed to get added (12.33 MB, application/x-xz)
2020-01-27 14:50 UTC, Nikolai Sednev
no flags Details
lynx 17 third ha host which did not succeed to get added (13.03 MB, application/x-xz)
2020-01-27 14:51 UTC, Nikolai Sednev
no flags Details
engine logs from hosted-engine-02.lab.eng.tlv2.redhat.com (10.20 MB, application/x-xz)
2020-01-27 14:52 UTC, Nikolai Sednev
no flags Details


Links
System ID Private Priority Status Summary Last Updated
oVirt gerrit 106750 0 master MERGED ansible: change stop&start to restart for vnc certificates deployment 2021-02-09 17:36:20 UTC

Description Lukas Svaty 2020-01-27 14:02:31 UTC
Description of problem:
Unable to add secondary host to HE engine due to error on stopping vdsm.

Version-Release number of selected component (if applicable):
4.4.0-17

How reproducible:
100%

Steps to Reproduce:
1. Deploy HE
2. Add secondary host

Actual results:
Jan 24 05:49:05 host.redhat.com ovirt-ha-agent[29496]: ovirt-ha-agent ovirt_hosted_engine_ha.agent.agent.Agent ERROR Traceback (most recent call last):
                                                                        File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py", line 85, in start_monitor
                                                                          response = self._proxy.start_monitor(type, options)
                                                                        File "/usr/lib64/python3.6/xmlrpc/client.py", line 1112, in __call__
                                                                          return self.__send(self.__name, args)
                                                                        File "/usr/lib64/python3.6/xmlrpc/client.py", line 1452, in __request
                                                                          verbose=self.__verbose
                                                                        File "/usr/lib64/python3.6/xmlrpc/client.py", line 1154, in request
                                                                          return self.single_request(host, handler, request_body, verbose)
                                                                        File "/usr/lib64/python3.6/xmlrpc/client.py", line 1166, in single_request
                                                                          http_conn = self.send_request(host, handler, request_body, verbose)
                                                                        File "/usr/lib64/python3.6/xmlrpc/client.py", line 1279, in send_request
                                                                          self.send_content(connection, request_body)
                                                                        File "/usr/lib64/python3.6/xmlrpc/client.py", line 1309, in send_content
                                                                          connection.endheaders(request_body)
                                                                        File "/usr/lib64/python3.6/http/client.py", line 1249, in endheaders
                                                                          self._send_output(message_body, encode_chunked=encode_chunked)
                                                                        File "/usr/lib64/python3.6/http/client.py", line 1036, in _send_output
                                                                          self.send(msg)
                                                                        File "/usr/lib64/python3.6/http/client.py", line 974, in send
                                                                          self.connect()
                                                                        File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/lib/unixrpc.py", line 74, in connect
                                                                          self.sock.connect(base64.b16decode(self.host))
                                                                      FileNotFoundError: [Errno 2] No such file or directory
                                                                     
                                                                      During handling of the above exception, another exception occurred:
                                                                     
                                                                      Traceback (most recent call last):
                                                                        File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/agent/agent.py", line 131, in _run_agent
                                                                          return action(he)
                                                                        File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/agent/agent.py", line 55, in action_proper
                                                                          return he.start_monitoring()
                                                                        File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 437, in start_monito>
                                                                          self._initialize_broker()
                                                                        File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 561, in _initialize_>
                                                                          m.get('options', {}))
                                                                        File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py", line 91, in start_monitor
                                                                          ).format(t=type, o=options, e=e)
                                                                      ovirt_hosted_engine_ha.lib.exceptions.RequestError: brokerlink - failed to start monitor via ovirt-ha-broker: [Err>



Expected results:
Host is added and vdsm successfully stopped on restart.

Comment 1 Nikolai Sednev 2020-01-27 14:29:38 UTC
I think the issue is here less /var/log/ovirt-hosted-engine-ha/broker.log
:
TypeError: '>' not supported between instances of 'int' and 'NoneType'
Thread-42829::INFO::2020-01-27 09:21:22,007::mgmt_bridge::73::mgmt_bridge.MgmtBridge::(action) Bridge ovirtmgmt not fo
und
Thread-42830::INFO::2020-01-27 09:21:22,048::mem_free::51::mem_free.MemFree::(action) memFree: 24956
Thread-42832::ERROR::2020-01-27 09:21:26,933::submonitor_base::119::ovirt_hosted_engine_ha.broker.submonitor_base.Subm
onitorBase::(_worker) Error executing submonitor engine-health, args {'address': '0', 'use_ssl': 'true', 'vm_uuid': 'e
809715c-d99b-4811-9c77-3de1d7a3e95d'}
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/broker/submonitor_base.py", line 115, in _worker
    self.action(self._options)
  File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/broker/submonitors/engine_health.py", line 117, in act
ion
    self._update_stats(stats, vdsm_ts, local_ts)
  File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/broker/submonitors/engine_health.py", line 122, in _update_stats
    if not self._newer_timestamp(vdsm_ts, local_ts):
  File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/broker/submonitors/engine_health.py", line 160, in _newer_timestamp
    return local_ts > self._stats_local_timestamp
TypeError: '>' not supported between instances of 'int' and 'NoneType'
Listener::INFO::2020-01-27 09:21:27,952::status_broker::242::ovirt_hosted_engine_ha.broker.status_broker.StatusBroker::(_acquire_whiteboard_lock) Acquired lock on host id 2
Thread-42829::INFO::2020-01-27 09:21:27,978::mgmt_bridge::73::mgmt_bridge.MgmtBridge::(action) Bridge ovirtmgmt not found
Thread-42830::INFO::2020-01-27 09:21:28,019::mem_free::51::mem_free.MemFree::(action) memFree: 24956
Listener::ERROR::2020-01-27 09:21:28,330::notifications::42::ovirt_hosted_engine_ha.broker.notifications.Notifications::(send_email) [Errno 111] Connection refused
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/broker/notifications.py", line 29, in send_email
    timeout=float(cfg["smtp-timeout"]))
  File "/usr/lib64/python3.6/smtplib.py", line 251, in __init__
    (code, msg) = self.connect(host, port)
  File "/usr/lib64/python3.6/smtplib.py", line 336, in connect
    self.sock = self._get_socket(host, port, self.timeout)
  File "/usr/lib64/python3.6/smtplib.py", line 307, in _get_socket
    self.source_address)
  File "/usr/lib64/python3.6/socket.py", line 724, in create_connection
    raise err
  File "/usr/lib64/python3.6/socket.py", line 713, in create_connection
    sock.connect(sa)
ConnectionRefusedError: [Errno 111] Connection refused


Looks like broker was unable to manage vdsm later on, although I did not seen anything interesting in vdsm log.
Engine also reports multiple errors:
2020-01-24 05:49:59,016-05 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThr
eadFactory-engine-Thread-27) [57dbd056] EVENT_ID: VDS_INSTALL_FAILED(505), Host host_mixed_3 installation failed. Task
 stop vdsmd failed to execute: .
2020-01-24 05:49:59,023-05 INFO  [org.ovirt.engine.core.bll.hostdeploy.InstallVdsInternalCommand] (EE-ManagedThreadFac
tory-engine-Thread-27) [57dbd056] Lock freed to object 'EngineLock:{exclusiveLocks='[bb98b322-c904-4868-bbd2-2f61fddce
66d=VDS]', sharedLocks=''}'
2020-01-24 05:50:15,601-05 ERROR [org.ovirt.engine.core.common.utils.ansible.AnsibleExecutor] (EE-ManagedThreadFactory
-engine-Thread-22) [328cfe72] Exception: Failed to read the runner-service response. Unexpected character ('<' (code 6
0)): expected a valid value (number, String, array, object, 'true', 'false' or 'null')
 at [Source: org.apache.http.conn.EofSensorInputStream@561438c2; line: 1, column: 2]
2020-01-24 05:50:15,955-05 ERROR [org.ovirt.engine.core.bll.hostdeploy.InstallVdsInternalCommand] (EE-ManagedThreadFac
tory-engine-Thread-22) [328cfe72] Host installation failed for host '7bccafb5-a839-4b8c-ac29-a6f6e968d9ab', 'host_mixe
d_2': Task stop vdsmd failed to execute: 
2020-01-24 05:50:15,958-05 INFO  [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand] (EE-ManagedThreadFactory-eng
ine-Thread-22) [328cfe72] START, SetVdsStatusVDSCommand(HostName = host_mixed_2, SetVdsStatusVDSCommandParameters:{hos
tId='7bccafb5-a839-4b8c-ac29-a6f6e968d9ab', status='InstallFailed', nonOperationalReason='NONE', stopSpmFailureLogged=
'false', maintenanceReason='null'}), log id: 3b14cd64
2020-01-24 05:50:15,963-05 INFO  [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand] (EE-ManagedThreadFactory-eng
ine-Thread-22) [328cfe72] FINISH, SetVdsStatusVDSCommand, return: , log id: 3b14cd64
2020-01-24 05:50:15,968-05 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThr
eadFactory-engine-Thread-22) [328cfe72] EVENT_ID: VDS_INSTALL_FAILED(505), Host host_mixed_2 installation failed. Task
 stop vdsmd failed to execute: .

Comment 2 Nikolai Sednev 2020-01-27 14:48:10 UTC
Adding logs from all 3 hosts and the engine.

Comment 3 Nikolai Sednev 2020-01-27 14:49:22 UTC
Created attachment 1655697 [details]
lynx14 first ha host running the engine

Comment 4 Nikolai Sednev 2020-01-27 14:50:42 UTC
Created attachment 1655699 [details]
lynx 16 second ha host which did not succeed to get added

lynx 16 second ha host which did not succeed to get added

Comment 5 Nikolai Sednev 2020-01-27 14:51:52 UTC
Created attachment 1655700 [details]
lynx 17 third ha host which did not succeed to get added

Comment 6 Nikolai Sednev 2020-01-27 14:52:47 UTC
Created attachment 1655701 [details]
engine logs from hosted-engine-02.lab.eng.tlv2.redhat.com

engine logs from hosted-engine-02.lab.eng.tlv2.redhat.com

Comment 7 Jiri Macku 2020-02-04 10:03:58 UTC
Hit this bug again in 4.4.0-18. 
From journalctl of host failed to be add:
Feb 04 03:51:30 ci-ge-2-h2.rhev.lab.eng.brq.redhat.com ovirt-ha-agent[18186]: ovirt-ha-agent ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine ERROR Failed to start necessary monitors
Feb 04 03:51:30 ci-ge-2-h2.rhev.lab.eng.brq.redhat.com ovirt-ha-agent[18186]: ovirt-ha-agent ovirt_hosted_engine_ha.agent.agent.Agent ERROR Traceback (most recent call last):
                                                                                File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py", line 85, in start_monitor
                                                                                  response = self._proxy.start_monitor(type, options)
                                                                                File "/usr/lib64/python3.6/xmlrpc/client.py", line 1112, in __call__
                                                                                  return self.__send(self.__name, args)
                                                                                File "/usr/lib64/python3.6/xmlrpc/client.py", line 1452, in __request
                                                                                  verbose=self.__verbose
                                                                                File "/usr/lib64/python3.6/xmlrpc/client.py", line 1154, in request
                                                                                  return self.single_request(host, handler, request_body, verbose)
                                                                                File "/usr/lib64/python3.6/xmlrpc/client.py", line 1166, in single_request
                                                                                  http_conn = self.send_request(host, handler, request_body, verbose)
                                                                                File "/usr/lib64/python3.6/xmlrpc/client.py", line 1279, in send_request
                                                                                  self.send_content(connection, request_body)
                                                                                File "/usr/lib64/python3.6/xmlrpc/client.py", line 1309, in send_content
                                                                                  connection.endheaders(request_body)
                                                                                File "/usr/lib64/python3.6/http/client.py", line 1249, in endheaders
                                                                                  self._send_output(message_body, encode_chunked=encode_chunked)
                                                                                File "/usr/lib64/python3.6/http/client.py", line 1036, in _send_output
                                                                                  self.send(msg)
                                                                                File "/usr/lib64/python3.6/http/client.py", line 974, in send
                                                                                  self.connect()
                                                                                File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/lib/unixrpc.py", line 74, in connect
                                                                                  self.sock.connect(base64.b16decode(self.host))
                                                                              FileNotFoundError: [Errno 2] No such file or directory
                                                                              
                                                                              During handling of the above exception, another exception occurred:
                                                                              
                                                                              Traceback (most recent call last):
                                                                                File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/agent/agent.py", line 131, in _run_agent
                                                                                  return action(he)
                                                                                File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/agent/agent.py", line 55, in action_proper
                                                                                  return he.start_monitoring()
                                                                                File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 437, in start_monitoring
                                                                                  self._initialize_broker()
                                                                                File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 561, in _initialize_broker
                                                                                  m.get('options', {}))
                                                                                File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py", line 91, in start_monitor
                                                                                  ).format(t=type, o=options, e=e)
                                                                              ovirt_hosted_engine_ha.lib.exceptions.RequestError: brokerlink - failed to start monitor via ovirt-ha-broker: [Errno 2] No such file or directory, [monitor: 'network', options: {'addr': '10.37.143.254', 'network_test': 'dns', 'tcp_t_address': '', 'tcp_t_port': ''}]
Feb 04 03:51:30 ci-ge-2-h2.rhev.lab.eng.brq.redhat.com ovirt-ha-agent[18186]: ovirt-ha-agent ovirt_hosted_engine_ha.agent.agent.Agent ERROR Trying to restart agent

Comment 8 Evgeny Slutsky 2020-02-04 12:24:44 UTC
in this log /var/log/ovirt-engine/host-deploy/ovirt-host-deploy-ansible-20200204034308-ci-ge-2-h2.rhev.lab.eng.brq.redhat.com-5cfbb596.log
 appears: 
2020-02-04 03:51:42 EST - TASK [Executing post tasks defined by user] ************************************
2020-02-04 03:51:42 EST - RUNNING HANDLER [ovirt-host-deploy-vnc-certificates : stop vdsmd] **************
2020-02-04 03:51:42 EST - fatal: [ci-ge-2-h2.rhev.lab.eng.brq.redhat.com]: FAILED! => {"changed": false, "msg": "Unable to stop service vdsmd: Job for vdsmd.service canceled.\n"}
2020-02-04 03:51:42 EST - {
  "status" : "OK",
  "msg" : "",
  "data" : {
    "event" : "runner_on_failed",
    "uuid" : "f6f2e712-2caa-406e-8488-6c46cf1241ac",
    "stdout" : "fatal: [ci-ge-2-h2.rhev.lab.eng.brq.redhat.com]: FAILED! => {\"changed\": false, \"msg\": \"Unable to stop service vdsmd: Job for vdsmd.service canceled.\\n\"}",
    "counter" : 538,
    "pid" : 3739,
    "created" : "2020-02-04T08:51:40.612863",
    "end_line" : 616,
    "runner_ident" : "5edf5708-472a-11ea-bd78-001a4ab10000",
    "start_line" : 615,
    "event_data" : {


it could be related to:  https://bugzilla.redhat.com/show_bug.cgi?id=1786118

Comment 10 Sandro Bonazzola 2020-02-04 12:53:26 UTC
(In reply to Evgeny Slutsky from comment #9)
> IMHO there should be retry in the engine stopping vdsmd service attempt:
> https://github.com/oVirt/ovirt-engine/blob/
> 3f5dffc1cda8c23cbb2b877c78bee1f8f7ee2e96/packaging/ansible-runner-service-
> project/project/roles/ovirt-host-deploy-vnc-certificates/handlers/main.yml#L1

make sense. we should also try to ensure ha broker doesn't try to start vdsm service while we are still deploying the host and stopping vdsm.

Comment 11 Michal Skrivanek 2020-02-04 14:13:55 UTC
(In reply to Sandro Bonazzola from comment #10)
> (In reply to Evgeny Slutsky from comment #9)
> > IMHO there should be retry in the engine stopping vdsmd service attempt:
> > https://github.com/oVirt/ovirt-engine/blob/
> > 3f5dffc1cda8c23cbb2b877c78bee1f8f7ee2e96/packaging/ansible-runner-service-
> > project/project/roles/ovirt-host-deploy-vnc-certificates/handlers/main.yml#L1
> 
> make sense. we should also try to ensure ha broker doesn't try to start vdsm
> service while we are still deploying the host and stopping vdsm.

so why not just stop all of the other services too? That's cleaner than a naive retry. Plus remove the auto-start during rpm installation and moving that to host deploy

Comment 12 Nikolai Sednev 2020-03-31 12:58:49 UTC
Worked for me on fresh and clean environment, I successfully added secondary ha-host with no issues, deployment of HE 4.4 was performed on NFS.
I also undeployed and then deployed secondary ha-host multiple times with no issues.
I added secondary host as regular none ha-host without any errors.

Tested on host with these components:
rhvm-appliance.x86_64 2:4.4-20200326.0.el8ev
ovirt-hosted-engine-setup-2.4.4-1.el8ev.noarch
ovirt-hosted-engine-ha-2.4.2-1.el8ev.noarch
Red Hat Enterprise Linux release 8.2 Beta (Ootpa)
Linux 4.18.0-193.el8.x86_64 #1 SMP Fri Mar 27 14:35:58 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

Engine:
ovirt-engine-setup-base-4.4.0-0.26.master.el8ev.noarch
ovirt-engine-4.4.0-0.26.master.el8ev.noarch
openvswitch2.11-2.11.0-48.el8fdp.x86_64
Linux 4.18.0-192.el8.x86_64 #1 SMP Tue Mar 24 14:06:40 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
Red Hat Enterprise Linux release 8.2 Beta (Ootpa)

Comment 13 Sandro Bonazzola 2020-05-20 20:03:17 UTC
This bugzilla is included in oVirt 4.4.0 release, published on May 20th 2020.

Since the problem described in this bug report should be
resolved in oVirt 4.4.0 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.