Description of problem: On the connection between the agent and the broker there is a timeout of 30 seconds, after that the connection it's dropped and the agent will restart. MainThread::DEBUG::2016-07-21 09:47:34,506::brokerlink::273::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(_communicate) Sending request: notify time=1469108854.51 type=state_transition detail=StartState-ReinitializeFSM hostname='poseidon.netsec' MainThread::DEBUG::2016-07-21 09:47:34,506::util::77::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(socket_readline) socket_readline with 30.0 seconds timeout MainThread::DEBUG::2016-07-21 09:48:04,545::util::88::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(socket_readline) Connection timeout while reading from socket MainThread::ERROR::2016-07-21 09:48:04,545::brokerlink::279::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(_communicate) Connection closed: Connection timed out MainThread::DEBUG::2016-07-21 09:48:04,546::brokerlink::86::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(disconnect) Closing connection to ha-broker MainThread::ERROR::2016-07-21 09:48:04,547::agent::205::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Error: 'Failed to start monitor <type 'type'>, options {'hostname': 'poseidon.netsec'}: Connection timed out' - trying to restart agent On some circumstances the broker has to send a notification email and this is done a synchronous way. If connecting the SMTP server requires more than 30 second, this will cause the agent to restart without a clear root cause indication (just timeout...). Now indeed we have: server = smtplib.SMTP(cfg["smtp-server"], port=cfg["smtp-port"]) If this happens 10 times in a row (not that uncommon if the smtp server is not well configured/working), the agent will disable itself. Version-Release number of selected component (if applicable): How reproducible: Just if the smtp server takes more than 30 seconds to be connected. Steps to Reproduce: 1. deploy hosted-engine 2. make the smtp server not reachable with a long timing to detect it 3. Actual results: The connection between the agent and the broker got dropped due to timeout, this cause the agent to restart. If this happens more than 10 times in a row, the agent will disable itself. Expected results: The agent should work also if the broker takes too long to send a notification email or at least it should fail with a clear error about sending the notification email. Additional info: We have to make the smtp connection async or, more simply, just add a timeout value (less than the connection timeout) in server = smtplib.SMTP(cfg["smtp-server"], port=cfg["smtp-port"])
Can you please fill doc-text?
ha-agent does not dies, although I do see this error in broker.log: Thread-3::ERROR::2016-08-30 16:57:32,259::notifications::39::ovirt_hosted_engine_ha.broker.notifications.Notifications::(send_email) timed out Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/notifications.py", line 26, in send_email timeout=float(cfg["smtp-timeout"])) File "/usr/lib64/python2.7/smtplib.py", line 255, in __init__ (code, msg) = self.connect(host, port) File "/usr/lib64/python2.7/smtplib.py", line 315, in connect self.sock = self._get_socket(host, port, self.timeout) File "/usr/lib64/python2.7/smtplib.py", line 290, in _get_socket return socket.create_connection((host, port), timeout) File "/usr/lib64/python2.7/socket.py", line 571, in create_connection raise err timeout: timed out Works for me on these components on host: libvirt-client-1.2.17-13.el7_2.5.x86_64 ovirt-imageio-common-0.3.0-0.el7ev.noarch ovirt-vmconsole-1.0.4-1.el7ev.noarch ovirt-hosted-engine-ha-2.0.3-1.el7ev.noarch qemu-kvm-rhev-2.3.0-31.el7_2.21.x86_64 sanlock-3.2.4-3.el7_2.x86_64 rhevm-appliance-20160731.0-1.el7ev.noarch ovirt-setup-lib-1.0.2-1.el7ev.noarch ovirt-hosted-engine-setup-2.0.1.5-1.el7ev.noarch mom-0.5.5-1.el7ev.noarch ovirt-host-deploy-1.5.1-1.el7ev.noarch vdsm-4.18.11-1.el7ev.x86_64 rhev-release-3.6.9-1-001.noarch ovirt-imageio-daemon-0.3.0-0.el7ev.noarch ovirt-vmconsole-host-1.0.4-1.el7ev.noarch rhev-release-4.0.3-1-001.noarch ovirt-engine-sdk-python-3.6.8.0-1.el7ev.noarch Linux version 3.10.0-327.36.1.el7.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-4) (GCC) ) #1 SMP Wed Aug 17 03:02:37 EDT 2016 Linux 3.10.0-327.36.1.el7.x86_64 #1 SMP Wed Aug 17 03:02:37 EDT 2016 x86_64 x86_64 x86_64 GNU/Linux Red Hat Enterprise Linux Server release 7.2 (Maipo) Engine: ovirt-engine-dwh-setup-4.0.2-1.el7ev.noarch ovirt-image-uploader-4.0.0-1.el7ev.noarch ovirt-imageio-proxy-setup-0.3.0-0.el7ev.noarch ovirt-engine-webadmin-portal-4.0.3-0.1.el7ev.noarch ovirt-engine-restapi-4.0.3-0.1.el7ev.noarch ovirt-host-deploy-1.5.1-1.el7ev.noarch ovirt-engine-extension-aaa-jdbc-1.1.0-1.el7ev.noarch ovirt-engine-cli-3.6.8.1-1.el7ev.noarch ovirt-engine-websocket-proxy-4.0.3-0.1.el7ev.noarch ovirt-vmconsole-1.0.4-1.el7ev.noarch ovirt-setup-lib-1.0.2-1.el7ev.noarch ovirt-engine-sdk-python-3.6.8.0-1.el7ev.noarch ovirt-log-collector-4.0.0-1.el7ev.noarch ovirt-imageio-proxy-0.3.0-0.el7ev.noarch ovirt-engine-tools-4.0.3-0.1.el7ev.noarch ovirt-engine-setup-base-4.0.3-0.1.el7ev.noarch ovirt-engine-setup-plugin-ovirt-engine-common-4.0.3-0.1.el7ev.noarch ovirt-engine-setup-plugin-ovirt-engine-4.0.3-0.1.el7ev.noarch python-ovirt-engine-sdk4-4.0.0-0.5.a5.el7ev.x86_64 ovirt-iso-uploader-4.0.0-1.el7ev.noarch ovirt-imageio-common-0.3.0-0.el7ev.noarch ovirt-engine-dashboard-1.0.3-1.el7ev.x86_64 ovirt-engine-userportal-4.0.3-0.1.el7ev.noarch ovirt-engine-4.0.3-0.1.el7ev.noarch ovirt-host-deploy-java-1.5.1-1.el7ev.noarch ovirt-engine-lib-4.0.3-0.1.el7ev.noarch ovirt-engine-setup-plugin-websocket-proxy-4.0.3-0.1.el7ev.noarch ovirt-engine-setup-4.0.3-0.1.el7ev.noarch ovirt-engine-vmconsole-proxy-helper-4.0.3-0.1.el7ev.noarch ovirt-engine-tools-backup-4.0.3-0.1.el7ev.noarch ovirt-vmconsole-proxy-1.0.4-1.el7ev.noarch ovirt-engine-dbscripts-4.0.3-0.1.el7ev.noarch ovirt-engine-dwh-4.0.2-1.el7ev.noarch ovirt-engine-setup-plugin-vmconsole-proxy-helper-4.0.3-0.1.el7ev.noarch ovirt-engine-extensions-api-impl-4.0.3-0.1.el7ev.noarch ovirt-engine-backend-4.0.3-0.1.el7ev.noarch rhevm-spice-client-x86-msi-4.0-3.el7ev.noarch rhevm-doc-4.0.0-3.el7ev.noarch rhevm-spice-client-x64-msi-4.0-3.el7ev.noarch rhev-guest-tools-iso-4.0-5.el7ev.noarch rhevm-4.0.3-0.1.el7ev.noarch rhevm-branding-rhev-4.0.0-5.el7ev.noarch rhevm-guest-agent-common-1.0.12-3.el7ev.noarch rhevm-dependencies-4.0.0-1.el7ev.noarch rhevm-setup-plugins-4.0.0.2-1.el7ev.noarch rhev-release-4.0.3-1-001.noarch Linux version 3.10.0-327.22.2.el7.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-4) (GCC) ) #1 SMP Thu Jun 9 10:09:10 EDT 2016 Linux 3.10.0-327.22.2.el7.x86_64 #1 SMP Thu Jun 9 10:09:10 EDT 2016 x86_64 x86_64 x86_64 GNU/Linux Red Hat Enterprise Linux Server release 7.2 (Maipo)