Bug 1359059

Summary: The agent got stuck if the broker takes more that 30 seconds to reach the smtp server
Product: [oVirt] ovirt-hosted-engine-ha Reporter: Simone Tiraboschi <stirabos>
Component: BrokerAssignee: Andrej Krejcir <akrejcir>
Status: CLOSED CURRENTRELEASE QA Contact: Nikolai Sednev <nsednev>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 2.0.1CC: akrejcir, bugs, dfediuck, gveitmic, mavital, mgoldboi, obockows, rs, sbonazzo
Target Milestone: ovirt-4.0.3Keywords: Triaged, ZStream
Target Release: 2.0.3Flags: rule-engine: ovirt-4.0.z+
rule-engine: blocker+
mgoldboi: planning_ack+
msivak: devel_ack+
mavital: testing_ack+
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: sla
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: HA broker waits for a nonresponsive SMTP server without timeout. Consequence: HA agent waits for the broker. Fix: Add a timeout to the connection between the broker and SMTP server. Result: Broker and agent do not wait for SMTP response for a long time.
Story Points: ---
Clone Of:
: 1364286 (view as bug list) Environment:
Last Closed: 2016-08-31 09:34:14 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: SLA RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1364286    

Description Simone Tiraboschi 2016-07-22 08:20:01 UTC
Description of problem:

On the connection between the agent and the broker there is a timeout of 30 seconds, after that the connection it's dropped and the agent will restart.

MainThread::DEBUG::2016-07-21 09:47:34,506::brokerlink::273::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(_communicate) Sending request: notify time=1469108854.51 type=state_transition detail=StartState-ReinitializeFSM hostname='poseidon.netsec'
MainThread::DEBUG::2016-07-21 09:47:34,506::util::77::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(socket_readline) socket_readline with 30.0 seconds timeout
MainThread::DEBUG::2016-07-21 09:48:04,545::util::88::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(socket_readline) Connection timeout while reading from socket
MainThread::ERROR::2016-07-21 09:48:04,545::brokerlink::279::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(_communicate) Connection closed: Connection timed out
MainThread::DEBUG::2016-07-21 09:48:04,546::brokerlink::86::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(disconnect) Closing connection to ha-broker
MainThread::ERROR::2016-07-21 09:48:04,547::agent::205::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Error: 'Failed to start monitor <type 'type'>, options {'hostname': 'poseidon.netsec'}: Connection timed out' - trying to restart agent

On some circumstances the broker has to send a notification email and this is done a synchronous way. If connecting the SMTP server requires more than 30 second, this will cause the agent to restart without a clear root cause indication (just timeout...).

Now indeed we have:
server = smtplib.SMTP(cfg["smtp-server"], port=cfg["smtp-port"])

If this happens 10 times in a row (not that uncommon if the smtp server is not well configured/working), the agent will disable itself.

Version-Release number of selected component (if applicable):


How reproducible:
Just if the smtp server takes more than 30 seconds to be connected.

Steps to Reproduce:
1. deploy hosted-engine
2. make the smtp server not reachable with a long timing to detect it
3.

Actual results:
The connection between the agent and the broker got dropped due to timeout, this cause the agent to restart. If this happens more than 10 times in a row, the agent will disable itself.

Expected results:
The agent should work also if the broker takes too long to send a notification email or at least it should fail with a clear error about sending the notification email.

Additional info:
We have to make the smtp connection async or, more simply, just add a timeout value (less than the connection timeout) in
server = smtplib.SMTP(cfg["smtp-server"], port=cfg["smtp-port"])

Comment 2 Sandro Bonazzola 2016-08-25 14:23:16 UTC
Can you please fill doc-text?

Comment 3 Nikolai Sednev 2016-08-30 14:04:47 UTC
ha-agent does not dies, although I do see this error in broker.log:
Thread-3::ERROR::2016-08-30 16:57:32,259::notifications::39::ovirt_hosted_engine_ha.broker.notifications.Notifications::(send_email) timed out
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/notifications.py", line 26, in send_email
    timeout=float(cfg["smtp-timeout"]))
  File "/usr/lib64/python2.7/smtplib.py", line 255, in __init__
    (code, msg) = self.connect(host, port)
  File "/usr/lib64/python2.7/smtplib.py", line 315, in connect
    self.sock = self._get_socket(host, port, self.timeout)
  File "/usr/lib64/python2.7/smtplib.py", line 290, in _get_socket
    return socket.create_connection((host, port), timeout)
  File "/usr/lib64/python2.7/socket.py", line 571, in create_connection
    raise err
timeout: timed out

Works for me on these components on host:
libvirt-client-1.2.17-13.el7_2.5.x86_64
ovirt-imageio-common-0.3.0-0.el7ev.noarch
ovirt-vmconsole-1.0.4-1.el7ev.noarch
ovirt-hosted-engine-ha-2.0.3-1.el7ev.noarch
qemu-kvm-rhev-2.3.0-31.el7_2.21.x86_64
sanlock-3.2.4-3.el7_2.x86_64
rhevm-appliance-20160731.0-1.el7ev.noarch
ovirt-setup-lib-1.0.2-1.el7ev.noarch
ovirt-hosted-engine-setup-2.0.1.5-1.el7ev.noarch
mom-0.5.5-1.el7ev.noarch
ovirt-host-deploy-1.5.1-1.el7ev.noarch
vdsm-4.18.11-1.el7ev.x86_64
rhev-release-3.6.9-1-001.noarch
ovirt-imageio-daemon-0.3.0-0.el7ev.noarch
ovirt-vmconsole-host-1.0.4-1.el7ev.noarch
rhev-release-4.0.3-1-001.noarch
ovirt-engine-sdk-python-3.6.8.0-1.el7ev.noarch
Linux version 3.10.0-327.36.1.el7.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-4) (GCC) ) #1 SMP Wed Aug 17 03:02:37 EDT 2016
Linux 3.10.0-327.36.1.el7.x86_64 #1 SMP Wed Aug 17 03:02:37 EDT 2016 x86_64 x86_64 x86_64 GNU/Linux
Red Hat Enterprise Linux Server release 7.2 (Maipo)

Engine:
ovirt-engine-dwh-setup-4.0.2-1.el7ev.noarch
ovirt-image-uploader-4.0.0-1.el7ev.noarch
ovirt-imageio-proxy-setup-0.3.0-0.el7ev.noarch
ovirt-engine-webadmin-portal-4.0.3-0.1.el7ev.noarch
ovirt-engine-restapi-4.0.3-0.1.el7ev.noarch
ovirt-host-deploy-1.5.1-1.el7ev.noarch
ovirt-engine-extension-aaa-jdbc-1.1.0-1.el7ev.noarch
ovirt-engine-cli-3.6.8.1-1.el7ev.noarch
ovirt-engine-websocket-proxy-4.0.3-0.1.el7ev.noarch
ovirt-vmconsole-1.0.4-1.el7ev.noarch
ovirt-setup-lib-1.0.2-1.el7ev.noarch
ovirt-engine-sdk-python-3.6.8.0-1.el7ev.noarch
ovirt-log-collector-4.0.0-1.el7ev.noarch
ovirt-imageio-proxy-0.3.0-0.el7ev.noarch
ovirt-engine-tools-4.0.3-0.1.el7ev.noarch
ovirt-engine-setup-base-4.0.3-0.1.el7ev.noarch
ovirt-engine-setup-plugin-ovirt-engine-common-4.0.3-0.1.el7ev.noarch
ovirt-engine-setup-plugin-ovirt-engine-4.0.3-0.1.el7ev.noarch
python-ovirt-engine-sdk4-4.0.0-0.5.a5.el7ev.x86_64
ovirt-iso-uploader-4.0.0-1.el7ev.noarch
ovirt-imageio-common-0.3.0-0.el7ev.noarch
ovirt-engine-dashboard-1.0.3-1.el7ev.x86_64
ovirt-engine-userportal-4.0.3-0.1.el7ev.noarch
ovirt-engine-4.0.3-0.1.el7ev.noarch
ovirt-host-deploy-java-1.5.1-1.el7ev.noarch
ovirt-engine-lib-4.0.3-0.1.el7ev.noarch
ovirt-engine-setup-plugin-websocket-proxy-4.0.3-0.1.el7ev.noarch
ovirt-engine-setup-4.0.3-0.1.el7ev.noarch
ovirt-engine-vmconsole-proxy-helper-4.0.3-0.1.el7ev.noarch
ovirt-engine-tools-backup-4.0.3-0.1.el7ev.noarch
ovirt-vmconsole-proxy-1.0.4-1.el7ev.noarch
ovirt-engine-dbscripts-4.0.3-0.1.el7ev.noarch
ovirt-engine-dwh-4.0.2-1.el7ev.noarch
ovirt-engine-setup-plugin-vmconsole-proxy-helper-4.0.3-0.1.el7ev.noarch
ovirt-engine-extensions-api-impl-4.0.3-0.1.el7ev.noarch
ovirt-engine-backend-4.0.3-0.1.el7ev.noarch
rhevm-spice-client-x86-msi-4.0-3.el7ev.noarch
rhevm-doc-4.0.0-3.el7ev.noarch
rhevm-spice-client-x64-msi-4.0-3.el7ev.noarch
rhev-guest-tools-iso-4.0-5.el7ev.noarch
rhevm-4.0.3-0.1.el7ev.noarch
rhevm-branding-rhev-4.0.0-5.el7ev.noarch
rhevm-guest-agent-common-1.0.12-3.el7ev.noarch
rhevm-dependencies-4.0.0-1.el7ev.noarch
rhevm-setup-plugins-4.0.0.2-1.el7ev.noarch
rhev-release-4.0.3-1-001.noarch
Linux version 3.10.0-327.22.2.el7.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-4) (GCC) ) #1 SMP Thu Jun 9 10:09:10 EDT 2016
Linux 3.10.0-327.22.2.el7.x86_64 #1 SMP Thu Jun 9 10:09:10 EDT 2016 x86_64 x86_64 x86_64 GNU/Linux
Red Hat Enterprise Linux Server release 7.2 (Maipo)