Bug 1410501 - If an engine API call got stuck, ovirt-hosted-engine-setup will wait forever
Summary: If an engine API call got stuck, ovirt-hosted-engine-setup will wait forever
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: ovirt-hosted-engine-setup
Classification: oVirt
Component: Plugins.General
Version: 2.0.0
Hardware: Unspecified
OS: Unspecified
urgent
medium
Target Milestone: ovirt-4.1.0-rc
: 2.1.0
Assignee: Simone Tiraboschi
QA Contact: Nikolai Sednev
URL:
Whiteboard: integration
: 1406486 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-01-05 15:47 UTC by Simone Tiraboschi
Modified: 2017-05-11 09:25 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
The API call can get lost due to the restart of the firewall by host-deploy: add a timeout and eventually retry
Clone Of:
Environment:
Last Closed: 2017-02-01 14:38:46 UTC
oVirt Team: Integration
Embargoed:
rule-engine: ovirt-4.1+


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
oVirt gerrit 69724 0 master MERGED API: add a default timeout for engine API 2017-01-09 15:01:51 UTC
oVirt gerrit 69860 0 ovirt-hosted-engine-setup-2.1 MERGED API: add a default timeout for engine API 2017-01-10 11:23:21 UTC

Description Simone Tiraboschi 2017-01-05 15:47:38 UTC
Description of problem:
The default timeout on oVirt python SDK is infinite so, if for any reason an API call got stuck, it will simply wait forever and so the application, ovirt-hosted-engine-setup, will also got stuck forever.

We saw it just once in Lago env.

ovirt-hosted-engine-setup asked the engine to add the log via the REST API and then it start polling on the REST API with the oVirt python SDK waiting for the host to come up at engine eyes before continuing.

hosts.add triggered host-deploy which reconfigured and restarted iptables on the host while ovirt-hosted-engine-setup was polling on the REST API.

We found ovirt-hosted-engine-setup stuck at 
2017-01-05 06:22:44 DEBUG otopi.plugins.ovirt_hosted_engine_setup.engine.add_host add_host._wait_host_ready:96 VDSM host in installing state
2017-01-05 06:22:45 DEBUG otopi.plugins.ovirt_hosted_engine_setup.engine.add_host add_host._wait_host_ready:96 VDSM host in installing state
2017-01-05 06:22:46 DEBUG otopi.plugins.ovirt_hosted_engine_setup.engine.add_host add_host._wait_host_ready:96 VDSM host in installing state

And after two hours it was still there.

checking iptables status we see that it got restarted by host-deploy exactly at:
gen 05 06:22:47 lago-he-basic-suite-3-6-host0 systemd[1]: Starting IPv4 firewall with iptables...
gen 05 06:22:47 lago-he-basic-suite-3-6-host0 iptables.init[15013]: iptables: Applying firewall rules: [  OK  ]
gen 05 06:22:47 lago-he-basic-suite-3-6-host0 systemd[1]: Started IPv4 firewall with iptables.

And in iptables configuration we have:
ACCEPT     all  --  anywhere             anywhere             state RELATED,ESTABLISHED

At the end on the host we can see:
ESTAB       0      0                                                               192.168.202.3:60572                                                                        192.168.202.99:https 

but we have no sign of the counter part connection on the engine VM so, having no timeout at all, it got stuck forever.


Version-Release number of selected component (if applicable):
ovirt-hosted-engine-setup.noarch     1.3.7.4-0.0.master.20160823094509.git7add02e.el7.centos

How reproducible:
really really difficult, it's a race conditions between opening the TCP connection and restarting iptables.
AFAIK the only case where it could happen is that SYN and SYN-ACK got correctly delivered but the ACK packet got lost so the client (oVirt python SDK on the engine VM) thinks that the connection is ESTABLISHED while the server (httpd on the engine VM) no.

Steps to Reproduce:
1. deploy hosted-engine
2.
3.

Actual results:
ovirt-hosted-engine-setup stuck forever an an API call, the last line in the log is VDSM host in installing state

Expected results:
The timeout will trigger and ovirt-hosted-engine-setup will try polling again

Additional info:

Comment 1 Simone Tiraboschi 2017-01-09 11:15:59 UTC
*** Bug 1406486 has been marked as a duplicate of this bug. ***

Comment 6 Simone Tiraboschi 2017-01-24 10:36:16 UTC
I see Doc Type: Bug Fix, no?

Comment 7 Nikolai Sednev 2017-01-25 19:10:30 UTC
Works for me on these components on host:
rhvm-appliance-4.1.20170119.1-1.el7ev.noarch
ovirt-hosted-engine-ha-2.1.0-1.el7ev.noarch
ovirt-hosted-engine-setup-2.1.0-2.el7ev.noarch
ovirt-host-deploy-1.6.0-1.el7ev.noarch
ovirt-imageio-common-0.5.0-0.el7ev.noarch
ovirt-vmconsole-host-1.0.4-1.el7ev.noarch
qemu-kvm-rhev-2.6.0-28.el7_3.3.x86_64
libvirt-client-2.0.0-10.el7_3.4.x86_64
mom-0.5.8-1.el7ev.noarch
vdsm-4.19.2-2.el7ev.x86_64
ovirt-setup-lib-1.1.0-1.el7ev.noarch
ovirt-engine-sdk-python-3.6.9.1-1.el7ev.noarch
ovirt-imageio-daemon-0.5.0-0.el7ev.noarch
ovirt-vmconsole-1.0.4-1.el7ev.noarch
sanlock-3.4.0-1.el7.x86_64
Linux version 3.10.0-514.6.1.el7.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-11) (GCC) ) #1 SMP Sat Dec 10 11:15:38 EST 2016
Linux 3.10.0-514.6.1.el7.x86_64 #1 SMP Sat Dec 10 11:15:38 EST 2016 x86_64 x86_64 x86_64 GNU/Linux
Red Hat Enterprise Linux Server release 7.3 (Maipo)

On engine:
rhev-guest-tools-iso-4.1-3.el7ev.noarch
rhevm-doc-4.1.0-1.el7ev.noarch
rhevm-dependencies-4.1.0-1.el7ev.noarch
rhevm-setup-plugins-4.1.0-1.el7ev.noarch
rhevm-4.1.0.1-0.1.el7.noarch
rhevm-guest-agent-common-1.0.12-3.el7ev.noarch
rhevm-branding-rhev-4.1.0-0.el7ev.noarch
Linux version 3.10.0-514.6.1.el7.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-11) (GCC) ) #1 SMP Sat Dec 10 11:15:38 EST 2016
Linux 3.10.0-514.6.1.el7.x86_64 #1 SMP Sat Dec 10 11:15:38 EST 2016 x86_64 x86_64 x86_64 GNU/Linux
Red Hat Enterprise Linux Server release 7.3 (Maipo)


Note You need to log in before you can comment on or make changes to this bug.