Bug 1795238
| Summary: | ovirt_hosted_engine_ha exception: failed to start monitor via ovirt-ha-broker | ||
|---|---|---|---|
| Product: | [oVirt] ovirt-hosted-engine-ha | Reporter: | Lukas Svaty <lsvaty> |
| Component: | Broker | Assignee: | Evgeny Slutsky <eslutsky> |
| Status: | CLOSED CURRENTRELEASE | QA Contact: | Nikolai Sednev <nsednev> |
| Severity: | high | Docs Contact: | |
| Priority: | high | ||
| Version: | 2.4.1 | CC: | bugs, eslutsky, jmacku, michal.skrivanek, mtessun, nsednev |
| Target Milestone: | ovirt-4.4.0 | Keywords: | Triaged |
| Target Release: | --- | Flags: | sbonazzo:
ovirt-4.4?
mtessun: planning_ack+ sbonazzo: devel_ack+ sbonazzo: testing_ack? |
| Hardware: | All | ||
| OS: | All | ||
| Whiteboard: | |||
| Fixed In Version: | ovirt-engine 4.4.0 alpha 551beb5b | Doc Type: | If docs needed, set a value |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2020-05-20 20:03:17 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | Integration | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | |||
| Bug Blocks: | 1665138 | ||
| Attachments: | |||
I think the issue is here less /var/log/ovirt-hosted-engine-ha/broker.log
:
TypeError: '>' not supported between instances of 'int' and 'NoneType'
Thread-42829::INFO::2020-01-27 09:21:22,007::mgmt_bridge::73::mgmt_bridge.MgmtBridge::(action) Bridge ovirtmgmt not fo
und
Thread-42830::INFO::2020-01-27 09:21:22,048::mem_free::51::mem_free.MemFree::(action) memFree: 24956
Thread-42832::ERROR::2020-01-27 09:21:26,933::submonitor_base::119::ovirt_hosted_engine_ha.broker.submonitor_base.Subm
onitorBase::(_worker) Error executing submonitor engine-health, args {'address': '0', 'use_ssl': 'true', 'vm_uuid': 'e
809715c-d99b-4811-9c77-3de1d7a3e95d'}
Traceback (most recent call last):
File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/broker/submonitor_base.py", line 115, in _worker
self.action(self._options)
File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/broker/submonitors/engine_health.py", line 117, in act
ion
self._update_stats(stats, vdsm_ts, local_ts)
File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/broker/submonitors/engine_health.py", line 122, in _update_stats
if not self._newer_timestamp(vdsm_ts, local_ts):
File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/broker/submonitors/engine_health.py", line 160, in _newer_timestamp
return local_ts > self._stats_local_timestamp
TypeError: '>' not supported between instances of 'int' and 'NoneType'
Listener::INFO::2020-01-27 09:21:27,952::status_broker::242::ovirt_hosted_engine_ha.broker.status_broker.StatusBroker::(_acquire_whiteboard_lock) Acquired lock on host id 2
Thread-42829::INFO::2020-01-27 09:21:27,978::mgmt_bridge::73::mgmt_bridge.MgmtBridge::(action) Bridge ovirtmgmt not found
Thread-42830::INFO::2020-01-27 09:21:28,019::mem_free::51::mem_free.MemFree::(action) memFree: 24956
Listener::ERROR::2020-01-27 09:21:28,330::notifications::42::ovirt_hosted_engine_ha.broker.notifications.Notifications::(send_email) [Errno 111] Connection refused
Traceback (most recent call last):
File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/broker/notifications.py", line 29, in send_email
timeout=float(cfg["smtp-timeout"]))
File "/usr/lib64/python3.6/smtplib.py", line 251, in __init__
(code, msg) = self.connect(host, port)
File "/usr/lib64/python3.6/smtplib.py", line 336, in connect
self.sock = self._get_socket(host, port, self.timeout)
File "/usr/lib64/python3.6/smtplib.py", line 307, in _get_socket
self.source_address)
File "/usr/lib64/python3.6/socket.py", line 724, in create_connection
raise err
File "/usr/lib64/python3.6/socket.py", line 713, in create_connection
sock.connect(sa)
ConnectionRefusedError: [Errno 111] Connection refused
Looks like broker was unable to manage vdsm later on, although I did not seen anything interesting in vdsm log.
Engine also reports multiple errors:
2020-01-24 05:49:59,016-05 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThr
eadFactory-engine-Thread-27) [57dbd056] EVENT_ID: VDS_INSTALL_FAILED(505), Host host_mixed_3 installation failed. Task
stop vdsmd failed to execute: .
2020-01-24 05:49:59,023-05 INFO [org.ovirt.engine.core.bll.hostdeploy.InstallVdsInternalCommand] (EE-ManagedThreadFac
tory-engine-Thread-27) [57dbd056] Lock freed to object 'EngineLock:{exclusiveLocks='[bb98b322-c904-4868-bbd2-2f61fddce
66d=VDS]', sharedLocks=''}'
2020-01-24 05:50:15,601-05 ERROR [org.ovirt.engine.core.common.utils.ansible.AnsibleExecutor] (EE-ManagedThreadFactory
-engine-Thread-22) [328cfe72] Exception: Failed to read the runner-service response. Unexpected character ('<' (code 6
0)): expected a valid value (number, String, array, object, 'true', 'false' or 'null')
at [Source: org.apache.http.conn.EofSensorInputStream@561438c2; line: 1, column: 2]
2020-01-24 05:50:15,955-05 ERROR [org.ovirt.engine.core.bll.hostdeploy.InstallVdsInternalCommand] (EE-ManagedThreadFac
tory-engine-Thread-22) [328cfe72] Host installation failed for host '7bccafb5-a839-4b8c-ac29-a6f6e968d9ab', 'host_mixe
d_2': Task stop vdsmd failed to execute:
2020-01-24 05:50:15,958-05 INFO [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand] (EE-ManagedThreadFactory-eng
ine-Thread-22) [328cfe72] START, SetVdsStatusVDSCommand(HostName = host_mixed_2, SetVdsStatusVDSCommandParameters:{hos
tId='7bccafb5-a839-4b8c-ac29-a6f6e968d9ab', status='InstallFailed', nonOperationalReason='NONE', stopSpmFailureLogged=
'false', maintenanceReason='null'}), log id: 3b14cd64
2020-01-24 05:50:15,963-05 INFO [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand] (EE-ManagedThreadFactory-eng
ine-Thread-22) [328cfe72] FINISH, SetVdsStatusVDSCommand, return: , log id: 3b14cd64
2020-01-24 05:50:15,968-05 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThr
eadFactory-engine-Thread-22) [328cfe72] EVENT_ID: VDS_INSTALL_FAILED(505), Host host_mixed_2 installation failed. Task
stop vdsmd failed to execute: .
Adding logs from all 3 hosts and the engine. Created attachment 1655697 [details]
lynx14 first ha host running the engine
Created attachment 1655699 [details]
lynx 16 second ha host which did not succeed to get added
lynx 16 second ha host which did not succeed to get added
Created attachment 1655700 [details]
lynx 17 third ha host which did not succeed to get added
Created attachment 1655701 [details]
engine logs from hosted-engine-02.lab.eng.tlv2.redhat.com
engine logs from hosted-engine-02.lab.eng.tlv2.redhat.com
Hit this bug again in 4.4.0-18.
From journalctl of host failed to be add:
Feb 04 03:51:30 ci-ge-2-h2.rhev.lab.eng.brq.redhat.com ovirt-ha-agent[18186]: ovirt-ha-agent ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine ERROR Failed to start necessary monitors
Feb 04 03:51:30 ci-ge-2-h2.rhev.lab.eng.brq.redhat.com ovirt-ha-agent[18186]: ovirt-ha-agent ovirt_hosted_engine_ha.agent.agent.Agent ERROR Traceback (most recent call last):
File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py", line 85, in start_monitor
response = self._proxy.start_monitor(type, options)
File "/usr/lib64/python3.6/xmlrpc/client.py", line 1112, in __call__
return self.__send(self.__name, args)
File "/usr/lib64/python3.6/xmlrpc/client.py", line 1452, in __request
verbose=self.__verbose
File "/usr/lib64/python3.6/xmlrpc/client.py", line 1154, in request
return self.single_request(host, handler, request_body, verbose)
File "/usr/lib64/python3.6/xmlrpc/client.py", line 1166, in single_request
http_conn = self.send_request(host, handler, request_body, verbose)
File "/usr/lib64/python3.6/xmlrpc/client.py", line 1279, in send_request
self.send_content(connection, request_body)
File "/usr/lib64/python3.6/xmlrpc/client.py", line 1309, in send_content
connection.endheaders(request_body)
File "/usr/lib64/python3.6/http/client.py", line 1249, in endheaders
self._send_output(message_body, encode_chunked=encode_chunked)
File "/usr/lib64/python3.6/http/client.py", line 1036, in _send_output
self.send(msg)
File "/usr/lib64/python3.6/http/client.py", line 974, in send
self.connect()
File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/lib/unixrpc.py", line 74, in connect
self.sock.connect(base64.b16decode(self.host))
FileNotFoundError: [Errno 2] No such file or directory
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/agent/agent.py", line 131, in _run_agent
return action(he)
File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/agent/agent.py", line 55, in action_proper
return he.start_monitoring()
File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 437, in start_monitoring
self._initialize_broker()
File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 561, in _initialize_broker
m.get('options', {}))
File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py", line 91, in start_monitor
).format(t=type, o=options, e=e)
ovirt_hosted_engine_ha.lib.exceptions.RequestError: brokerlink - failed to start monitor via ovirt-ha-broker: [Errno 2] No such file or directory, [monitor: 'network', options: {'addr': '10.37.143.254', 'network_test': 'dns', 'tcp_t_address': '', 'tcp_t_port': ''}]
Feb 04 03:51:30 ci-ge-2-h2.rhev.lab.eng.brq.redhat.com ovirt-ha-agent[18186]: ovirt-ha-agent ovirt_hosted_engine_ha.agent.agent.Agent ERROR Trying to restart agent
in this log /var/log/ovirt-engine/host-deploy/ovirt-host-deploy-ansible-20200204034308-ci-ge-2-h2.rhev.lab.eng.brq.redhat.com-5cfbb596.log
appears:
2020-02-04 03:51:42 EST - TASK [Executing post tasks defined by user] ************************************
2020-02-04 03:51:42 EST - RUNNING HANDLER [ovirt-host-deploy-vnc-certificates : stop vdsmd] **************
2020-02-04 03:51:42 EST - fatal: [ci-ge-2-h2.rhev.lab.eng.brq.redhat.com]: FAILED! => {"changed": false, "msg": "Unable to stop service vdsmd: Job for vdsmd.service canceled.\n"}
2020-02-04 03:51:42 EST - {
"status" : "OK",
"msg" : "",
"data" : {
"event" : "runner_on_failed",
"uuid" : "f6f2e712-2caa-406e-8488-6c46cf1241ac",
"stdout" : "fatal: [ci-ge-2-h2.rhev.lab.eng.brq.redhat.com]: FAILED! => {\"changed\": false, \"msg\": \"Unable to stop service vdsmd: Job for vdsmd.service canceled.\\n\"}",
"counter" : 538,
"pid" : 3739,
"created" : "2020-02-04T08:51:40.612863",
"end_line" : 616,
"runner_ident" : "5edf5708-472a-11ea-bd78-001a4ab10000",
"start_line" : 615,
"event_data" : {
it could be related to: https://bugzilla.redhat.com/show_bug.cgi?id=1786118
IMHO there should be retry in the engine stopping vdsmd service attempt: https://github.com/oVirt/ovirt-engine/blob/3f5dffc1cda8c23cbb2b877c78bee1f8f7ee2e96/packaging/ansible-runner-service-project/project/roles/ovirt-host-deploy-vnc-certificates/handlers/main.yml#L1 (In reply to Evgeny Slutsky from comment #9) > IMHO there should be retry in the engine stopping vdsmd service attempt: > https://github.com/oVirt/ovirt-engine/blob/ > 3f5dffc1cda8c23cbb2b877c78bee1f8f7ee2e96/packaging/ansible-runner-service- > project/project/roles/ovirt-host-deploy-vnc-certificates/handlers/main.yml#L1 make sense. we should also try to ensure ha broker doesn't try to start vdsm service while we are still deploying the host and stopping vdsm. (In reply to Sandro Bonazzola from comment #10) > (In reply to Evgeny Slutsky from comment #9) > > IMHO there should be retry in the engine stopping vdsmd service attempt: > > https://github.com/oVirt/ovirt-engine/blob/ > > 3f5dffc1cda8c23cbb2b877c78bee1f8f7ee2e96/packaging/ansible-runner-service- > > project/project/roles/ovirt-host-deploy-vnc-certificates/handlers/main.yml#L1 > > make sense. we should also try to ensure ha broker doesn't try to start vdsm > service while we are still deploying the host and stopping vdsm. so why not just stop all of the other services too? That's cleaner than a naive retry. Plus remove the auto-start during rpm installation and moving that to host deploy Worked for me on fresh and clean environment, I successfully added secondary ha-host with no issues, deployment of HE 4.4 was performed on NFS. I also undeployed and then deployed secondary ha-host multiple times with no issues. I added secondary host as regular none ha-host without any errors. Tested on host with these components: rhvm-appliance.x86_64 2:4.4-20200326.0.el8ev ovirt-hosted-engine-setup-2.4.4-1.el8ev.noarch ovirt-hosted-engine-ha-2.4.2-1.el8ev.noarch Red Hat Enterprise Linux release 8.2 Beta (Ootpa) Linux 4.18.0-193.el8.x86_64 #1 SMP Fri Mar 27 14:35:58 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux Engine: ovirt-engine-setup-base-4.4.0-0.26.master.el8ev.noarch ovirt-engine-4.4.0-0.26.master.el8ev.noarch openvswitch2.11-2.11.0-48.el8fdp.x86_64 Linux 4.18.0-192.el8.x86_64 #1 SMP Tue Mar 24 14:06:40 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux Red Hat Enterprise Linux release 8.2 Beta (Ootpa) This bugzilla is included in oVirt 4.4.0 release, published on May 20th 2020. Since the problem described in this bug report should be resolved in oVirt 4.4.0 release, it has been closed with a resolution of CURRENT RELEASE. If the solution does not work for you, please open a new bug report. |
Description of problem: Unable to add secondary host to HE engine due to error on stopping vdsm. Version-Release number of selected component (if applicable): 4.4.0-17 How reproducible: 100% Steps to Reproduce: 1. Deploy HE 2. Add secondary host Actual results: Jan 24 05:49:05 host.redhat.com ovirt-ha-agent[29496]: ovirt-ha-agent ovirt_hosted_engine_ha.agent.agent.Agent ERROR Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py", line 85, in start_monitor response = self._proxy.start_monitor(type, options) File "/usr/lib64/python3.6/xmlrpc/client.py", line 1112, in __call__ return self.__send(self.__name, args) File "/usr/lib64/python3.6/xmlrpc/client.py", line 1452, in __request verbose=self.__verbose File "/usr/lib64/python3.6/xmlrpc/client.py", line 1154, in request return self.single_request(host, handler, request_body, verbose) File "/usr/lib64/python3.6/xmlrpc/client.py", line 1166, in single_request http_conn = self.send_request(host, handler, request_body, verbose) File "/usr/lib64/python3.6/xmlrpc/client.py", line 1279, in send_request self.send_content(connection, request_body) File "/usr/lib64/python3.6/xmlrpc/client.py", line 1309, in send_content connection.endheaders(request_body) File "/usr/lib64/python3.6/http/client.py", line 1249, in endheaders self._send_output(message_body, encode_chunked=encode_chunked) File "/usr/lib64/python3.6/http/client.py", line 1036, in _send_output self.send(msg) File "/usr/lib64/python3.6/http/client.py", line 974, in send self.connect() File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/lib/unixrpc.py", line 74, in connect self.sock.connect(base64.b16decode(self.host)) FileNotFoundError: [Errno 2] No such file or directory During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/agent/agent.py", line 131, in _run_agent return action(he) File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/agent/agent.py", line 55, in action_proper return he.start_monitoring() File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 437, in start_monito> self._initialize_broker() File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 561, in _initialize_> m.get('options', {})) File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py", line 91, in start_monitor ).format(t=type, o=options, e=e) ovirt_hosted_engine_ha.lib.exceptions.RequestError: brokerlink - failed to start monitor via ovirt-ha-broker: [Err> Expected results: Host is added and vdsm successfully stopped on restart.