Description of problem: Unable to add secondary host to HE engine due to error on stopping vdsm. Version-Release number of selected component (if applicable): 4.4.0-17 How reproducible: 100% Steps to Reproduce: 1. Deploy HE 2. Add secondary host Actual results: Jan 24 05:49:05 host.redhat.com ovirt-ha-agent[29496]: ovirt-ha-agent ovirt_hosted_engine_ha.agent.agent.Agent ERROR Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py", line 85, in start_monitor response = self._proxy.start_monitor(type, options) File "/usr/lib64/python3.6/xmlrpc/client.py", line 1112, in __call__ return self.__send(self.__name, args) File "/usr/lib64/python3.6/xmlrpc/client.py", line 1452, in __request verbose=self.__verbose File "/usr/lib64/python3.6/xmlrpc/client.py", line 1154, in request return self.single_request(host, handler, request_body, verbose) File "/usr/lib64/python3.6/xmlrpc/client.py", line 1166, in single_request http_conn = self.send_request(host, handler, request_body, verbose) File "/usr/lib64/python3.6/xmlrpc/client.py", line 1279, in send_request self.send_content(connection, request_body) File "/usr/lib64/python3.6/xmlrpc/client.py", line 1309, in send_content connection.endheaders(request_body) File "/usr/lib64/python3.6/http/client.py", line 1249, in endheaders self._send_output(message_body, encode_chunked=encode_chunked) File "/usr/lib64/python3.6/http/client.py", line 1036, in _send_output self.send(msg) File "/usr/lib64/python3.6/http/client.py", line 974, in send self.connect() File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/lib/unixrpc.py", line 74, in connect self.sock.connect(base64.b16decode(self.host)) FileNotFoundError: [Errno 2] No such file or directory During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/agent/agent.py", line 131, in _run_agent return action(he) File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/agent/agent.py", line 55, in action_proper return he.start_monitoring() File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 437, in start_monito> self._initialize_broker() File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 561, in _initialize_> m.get('options', {})) File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py", line 91, in start_monitor ).format(t=type, o=options, e=e) ovirt_hosted_engine_ha.lib.exceptions.RequestError: brokerlink - failed to start monitor via ovirt-ha-broker: [Err> Expected results: Host is added and vdsm successfully stopped on restart.
I think the issue is here less /var/log/ovirt-hosted-engine-ha/broker.log : TypeError: '>' not supported between instances of 'int' and 'NoneType' Thread-42829::INFO::2020-01-27 09:21:22,007::mgmt_bridge::73::mgmt_bridge.MgmtBridge::(action) Bridge ovirtmgmt not fo und Thread-42830::INFO::2020-01-27 09:21:22,048::mem_free::51::mem_free.MemFree::(action) memFree: 24956 Thread-42832::ERROR::2020-01-27 09:21:26,933::submonitor_base::119::ovirt_hosted_engine_ha.broker.submonitor_base.Subm onitorBase::(_worker) Error executing submonitor engine-health, args {'address': '0', 'use_ssl': 'true', 'vm_uuid': 'e 809715c-d99b-4811-9c77-3de1d7a3e95d'} Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/broker/submonitor_base.py", line 115, in _worker self.action(self._options) File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/broker/submonitors/engine_health.py", line 117, in act ion self._update_stats(stats, vdsm_ts, local_ts) File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/broker/submonitors/engine_health.py", line 122, in _update_stats if not self._newer_timestamp(vdsm_ts, local_ts): File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/broker/submonitors/engine_health.py", line 160, in _newer_timestamp return local_ts > self._stats_local_timestamp TypeError: '>' not supported between instances of 'int' and 'NoneType' Listener::INFO::2020-01-27 09:21:27,952::status_broker::242::ovirt_hosted_engine_ha.broker.status_broker.StatusBroker::(_acquire_whiteboard_lock) Acquired lock on host id 2 Thread-42829::INFO::2020-01-27 09:21:27,978::mgmt_bridge::73::mgmt_bridge.MgmtBridge::(action) Bridge ovirtmgmt not found Thread-42830::INFO::2020-01-27 09:21:28,019::mem_free::51::mem_free.MemFree::(action) memFree: 24956 Listener::ERROR::2020-01-27 09:21:28,330::notifications::42::ovirt_hosted_engine_ha.broker.notifications.Notifications::(send_email) [Errno 111] Connection refused Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/broker/notifications.py", line 29, in send_email timeout=float(cfg["smtp-timeout"])) File "/usr/lib64/python3.6/smtplib.py", line 251, in __init__ (code, msg) = self.connect(host, port) File "/usr/lib64/python3.6/smtplib.py", line 336, in connect self.sock = self._get_socket(host, port, self.timeout) File "/usr/lib64/python3.6/smtplib.py", line 307, in _get_socket self.source_address) File "/usr/lib64/python3.6/socket.py", line 724, in create_connection raise err File "/usr/lib64/python3.6/socket.py", line 713, in create_connection sock.connect(sa) ConnectionRefusedError: [Errno 111] Connection refused Looks like broker was unable to manage vdsm later on, although I did not seen anything interesting in vdsm log. Engine also reports multiple errors: 2020-01-24 05:49:59,016-05 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThr eadFactory-engine-Thread-27) [57dbd056] EVENT_ID: VDS_INSTALL_FAILED(505), Host host_mixed_3 installation failed. Task stop vdsmd failed to execute: . 2020-01-24 05:49:59,023-05 INFO [org.ovirt.engine.core.bll.hostdeploy.InstallVdsInternalCommand] (EE-ManagedThreadFac tory-engine-Thread-27) [57dbd056] Lock freed to object 'EngineLock:{exclusiveLocks='[bb98b322-c904-4868-bbd2-2f61fddce 66d=VDS]', sharedLocks=''}' 2020-01-24 05:50:15,601-05 ERROR [org.ovirt.engine.core.common.utils.ansible.AnsibleExecutor] (EE-ManagedThreadFactory -engine-Thread-22) [328cfe72] Exception: Failed to read the runner-service response. Unexpected character ('<' (code 6 0)): expected a valid value (number, String, array, object, 'true', 'false' or 'null') at [Source: org.apache.http.conn.EofSensorInputStream@561438c2; line: 1, column: 2] 2020-01-24 05:50:15,955-05 ERROR [org.ovirt.engine.core.bll.hostdeploy.InstallVdsInternalCommand] (EE-ManagedThreadFac tory-engine-Thread-22) [328cfe72] Host installation failed for host '7bccafb5-a839-4b8c-ac29-a6f6e968d9ab', 'host_mixe d_2': Task stop vdsmd failed to execute: 2020-01-24 05:50:15,958-05 INFO [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand] (EE-ManagedThreadFactory-eng ine-Thread-22) [328cfe72] START, SetVdsStatusVDSCommand(HostName = host_mixed_2, SetVdsStatusVDSCommandParameters:{hos tId='7bccafb5-a839-4b8c-ac29-a6f6e968d9ab', status='InstallFailed', nonOperationalReason='NONE', stopSpmFailureLogged= 'false', maintenanceReason='null'}), log id: 3b14cd64 2020-01-24 05:50:15,963-05 INFO [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand] (EE-ManagedThreadFactory-eng ine-Thread-22) [328cfe72] FINISH, SetVdsStatusVDSCommand, return: , log id: 3b14cd64 2020-01-24 05:50:15,968-05 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThr eadFactory-engine-Thread-22) [328cfe72] EVENT_ID: VDS_INSTALL_FAILED(505), Host host_mixed_2 installation failed. Task stop vdsmd failed to execute: .
Adding logs from all 3 hosts and the engine.
Created attachment 1655697 [details] lynx14 first ha host running the engine
Created attachment 1655699 [details] lynx 16 second ha host which did not succeed to get added lynx 16 second ha host which did not succeed to get added
Created attachment 1655700 [details] lynx 17 third ha host which did not succeed to get added
Created attachment 1655701 [details] engine logs from hosted-engine-02.lab.eng.tlv2.redhat.com engine logs from hosted-engine-02.lab.eng.tlv2.redhat.com
Hit this bug again in 4.4.0-18. From journalctl of host failed to be add: Feb 04 03:51:30 ci-ge-2-h2.rhev.lab.eng.brq.redhat.com ovirt-ha-agent[18186]: ovirt-ha-agent ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine ERROR Failed to start necessary monitors Feb 04 03:51:30 ci-ge-2-h2.rhev.lab.eng.brq.redhat.com ovirt-ha-agent[18186]: ovirt-ha-agent ovirt_hosted_engine_ha.agent.agent.Agent ERROR Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py", line 85, in start_monitor response = self._proxy.start_monitor(type, options) File "/usr/lib64/python3.6/xmlrpc/client.py", line 1112, in __call__ return self.__send(self.__name, args) File "/usr/lib64/python3.6/xmlrpc/client.py", line 1452, in __request verbose=self.__verbose File "/usr/lib64/python3.6/xmlrpc/client.py", line 1154, in request return self.single_request(host, handler, request_body, verbose) File "/usr/lib64/python3.6/xmlrpc/client.py", line 1166, in single_request http_conn = self.send_request(host, handler, request_body, verbose) File "/usr/lib64/python3.6/xmlrpc/client.py", line 1279, in send_request self.send_content(connection, request_body) File "/usr/lib64/python3.6/xmlrpc/client.py", line 1309, in send_content connection.endheaders(request_body) File "/usr/lib64/python3.6/http/client.py", line 1249, in endheaders self._send_output(message_body, encode_chunked=encode_chunked) File "/usr/lib64/python3.6/http/client.py", line 1036, in _send_output self.send(msg) File "/usr/lib64/python3.6/http/client.py", line 974, in send self.connect() File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/lib/unixrpc.py", line 74, in connect self.sock.connect(base64.b16decode(self.host)) FileNotFoundError: [Errno 2] No such file or directory During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/agent/agent.py", line 131, in _run_agent return action(he) File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/agent/agent.py", line 55, in action_proper return he.start_monitoring() File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 437, in start_monitoring self._initialize_broker() File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 561, in _initialize_broker m.get('options', {})) File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py", line 91, in start_monitor ).format(t=type, o=options, e=e) ovirt_hosted_engine_ha.lib.exceptions.RequestError: brokerlink - failed to start monitor via ovirt-ha-broker: [Errno 2] No such file or directory, [monitor: 'network', options: {'addr': '10.37.143.254', 'network_test': 'dns', 'tcp_t_address': '', 'tcp_t_port': ''}] Feb 04 03:51:30 ci-ge-2-h2.rhev.lab.eng.brq.redhat.com ovirt-ha-agent[18186]: ovirt-ha-agent ovirt_hosted_engine_ha.agent.agent.Agent ERROR Trying to restart agent
in this log /var/log/ovirt-engine/host-deploy/ovirt-host-deploy-ansible-20200204034308-ci-ge-2-h2.rhev.lab.eng.brq.redhat.com-5cfbb596.log appears: 2020-02-04 03:51:42 EST - TASK [Executing post tasks defined by user] ************************************ 2020-02-04 03:51:42 EST - RUNNING HANDLER [ovirt-host-deploy-vnc-certificates : stop vdsmd] ************** 2020-02-04 03:51:42 EST - fatal: [ci-ge-2-h2.rhev.lab.eng.brq.redhat.com]: FAILED! => {"changed": false, "msg": "Unable to stop service vdsmd: Job for vdsmd.service canceled.\n"} 2020-02-04 03:51:42 EST - { "status" : "OK", "msg" : "", "data" : { "event" : "runner_on_failed", "uuid" : "f6f2e712-2caa-406e-8488-6c46cf1241ac", "stdout" : "fatal: [ci-ge-2-h2.rhev.lab.eng.brq.redhat.com]: FAILED! => {\"changed\": false, \"msg\": \"Unable to stop service vdsmd: Job for vdsmd.service canceled.\\n\"}", "counter" : 538, "pid" : 3739, "created" : "2020-02-04T08:51:40.612863", "end_line" : 616, "runner_ident" : "5edf5708-472a-11ea-bd78-001a4ab10000", "start_line" : 615, "event_data" : { it could be related to: https://bugzilla.redhat.com/show_bug.cgi?id=1786118
IMHO there should be retry in the engine stopping vdsmd service attempt: https://github.com/oVirt/ovirt-engine/blob/3f5dffc1cda8c23cbb2b877c78bee1f8f7ee2e96/packaging/ansible-runner-service-project/project/roles/ovirt-host-deploy-vnc-certificates/handlers/main.yml#L1
(In reply to Evgeny Slutsky from comment #9) > IMHO there should be retry in the engine stopping vdsmd service attempt: > https://github.com/oVirt/ovirt-engine/blob/ > 3f5dffc1cda8c23cbb2b877c78bee1f8f7ee2e96/packaging/ansible-runner-service- > project/project/roles/ovirt-host-deploy-vnc-certificates/handlers/main.yml#L1 make sense. we should also try to ensure ha broker doesn't try to start vdsm service while we are still deploying the host and stopping vdsm.
(In reply to Sandro Bonazzola from comment #10) > (In reply to Evgeny Slutsky from comment #9) > > IMHO there should be retry in the engine stopping vdsmd service attempt: > > https://github.com/oVirt/ovirt-engine/blob/ > > 3f5dffc1cda8c23cbb2b877c78bee1f8f7ee2e96/packaging/ansible-runner-service- > > project/project/roles/ovirt-host-deploy-vnc-certificates/handlers/main.yml#L1 > > make sense. we should also try to ensure ha broker doesn't try to start vdsm > service while we are still deploying the host and stopping vdsm. so why not just stop all of the other services too? That's cleaner than a naive retry. Plus remove the auto-start during rpm installation and moving that to host deploy
Worked for me on fresh and clean environment, I successfully added secondary ha-host with no issues, deployment of HE 4.4 was performed on NFS. I also undeployed and then deployed secondary ha-host multiple times with no issues. I added secondary host as regular none ha-host without any errors. Tested on host with these components: rhvm-appliance.x86_64 2:4.4-20200326.0.el8ev ovirt-hosted-engine-setup-2.4.4-1.el8ev.noarch ovirt-hosted-engine-ha-2.4.2-1.el8ev.noarch Red Hat Enterprise Linux release 8.2 Beta (Ootpa) Linux 4.18.0-193.el8.x86_64 #1 SMP Fri Mar 27 14:35:58 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux Engine: ovirt-engine-setup-base-4.4.0-0.26.master.el8ev.noarch ovirt-engine-4.4.0-0.26.master.el8ev.noarch openvswitch2.11-2.11.0-48.el8fdp.x86_64 Linux 4.18.0-192.el8.x86_64 #1 SMP Tue Mar 24 14:06:40 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux Red Hat Enterprise Linux release 8.2 Beta (Ootpa)
This bugzilla is included in oVirt 4.4.0 release, published on May 20th 2020. Since the problem described in this bug report should be resolved in oVirt 4.4.0 release, it has been closed with a resolution of CURRENT RELEASE. If the solution does not work for you, please open a new bug report.