Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1410501

Summary: If an engine API call got stuck, ovirt-hosted-engine-setup will wait forever
Product: [oVirt] ovirt-hosted-engine-setup Reporter: Simone Tiraboschi <stirabos>
Component: Plugins.GeneralAssignee: Simone Tiraboschi <stirabos>
Status: CLOSED CURRENTRELEASE QA Contact: Nikolai Sednev <nsednev>
Severity: medium Docs Contact:
Priority: urgent    
Version: 2.0.0CC: bugs, nsednev, stirabos
Target Milestone: ovirt-4.1.0-rcKeywords: Triaged
Target Release: 2.1.0Flags: rule-engine: ovirt-4.1+
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: integration
Fixed In Version: Doc Type: Bug Fix
Doc Text:
The API call can get lost due to the restart of the firewall by host-deploy: add a timeout and eventually retry
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-02-01 14:38:46 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Integration RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Simone Tiraboschi 2017-01-05 15:47:38 UTC
Description of problem:
The default timeout on oVirt python SDK is infinite so, if for any reason an API call got stuck, it will simply wait forever and so the application, ovirt-hosted-engine-setup, will also got stuck forever.

We saw it just once in Lago env.

ovirt-hosted-engine-setup asked the engine to add the log via the REST API and then it start polling on the REST API with the oVirt python SDK waiting for the host to come up at engine eyes before continuing.

hosts.add triggered host-deploy which reconfigured and restarted iptables on the host while ovirt-hosted-engine-setup was polling on the REST API.

We found ovirt-hosted-engine-setup stuck at 
2017-01-05 06:22:44 DEBUG otopi.plugins.ovirt_hosted_engine_setup.engine.add_host add_host._wait_host_ready:96 VDSM host in installing state
2017-01-05 06:22:45 DEBUG otopi.plugins.ovirt_hosted_engine_setup.engine.add_host add_host._wait_host_ready:96 VDSM host in installing state
2017-01-05 06:22:46 DEBUG otopi.plugins.ovirt_hosted_engine_setup.engine.add_host add_host._wait_host_ready:96 VDSM host in installing state

And after two hours it was still there.

checking iptables status we see that it got restarted by host-deploy exactly at:
gen 05 06:22:47 lago-he-basic-suite-3-6-host0 systemd[1]: Starting IPv4 firewall with iptables...
gen 05 06:22:47 lago-he-basic-suite-3-6-host0 iptables.init[15013]: iptables: Applying firewall rules: [  OK  ]
gen 05 06:22:47 lago-he-basic-suite-3-6-host0 systemd[1]: Started IPv4 firewall with iptables.

And in iptables configuration we have:
ACCEPT     all  --  anywhere             anywhere             state RELATED,ESTABLISHED

At the end on the host we can see:
ESTAB       0      0                                                               192.168.202.3:60572                                                                        192.168.202.99:https 

but we have no sign of the counter part connection on the engine VM so, having no timeout at all, it got stuck forever.


Version-Release number of selected component (if applicable):
ovirt-hosted-engine-setup.noarch     1.3.7.4-0.0.master.20160823094509.git7add02e.el7.centos

How reproducible:
really really difficult, it's a race conditions between opening the TCP connection and restarting iptables.
AFAIK the only case where it could happen is that SYN and SYN-ACK got correctly delivered but the ACK packet got lost so the client (oVirt python SDK on the engine VM) thinks that the connection is ESTABLISHED while the server (httpd on the engine VM) no.

Steps to Reproduce:
1. deploy hosted-engine
2.
3.

Actual results:
ovirt-hosted-engine-setup stuck forever an an API call, the last line in the log is VDSM host in installing state

Expected results:
The timeout will trigger and ovirt-hosted-engine-setup will try polling again

Additional info:

Comment 1 Simone Tiraboschi 2017-01-09 11:15:59 UTC
*** Bug 1406486 has been marked as a duplicate of this bug. ***

Comment 6 Simone Tiraboschi 2017-01-24 10:36:16 UTC
I see Doc Type: Bug Fix, no?

Comment 7 Nikolai Sednev 2017-01-25 19:10:30 UTC
Works for me on these components on host:
rhvm-appliance-4.1.20170119.1-1.el7ev.noarch
ovirt-hosted-engine-ha-2.1.0-1.el7ev.noarch
ovirt-hosted-engine-setup-2.1.0-2.el7ev.noarch
ovirt-host-deploy-1.6.0-1.el7ev.noarch
ovirt-imageio-common-0.5.0-0.el7ev.noarch
ovirt-vmconsole-host-1.0.4-1.el7ev.noarch
qemu-kvm-rhev-2.6.0-28.el7_3.3.x86_64
libvirt-client-2.0.0-10.el7_3.4.x86_64
mom-0.5.8-1.el7ev.noarch
vdsm-4.19.2-2.el7ev.x86_64
ovirt-setup-lib-1.1.0-1.el7ev.noarch
ovirt-engine-sdk-python-3.6.9.1-1.el7ev.noarch
ovirt-imageio-daemon-0.5.0-0.el7ev.noarch
ovirt-vmconsole-1.0.4-1.el7ev.noarch
sanlock-3.4.0-1.el7.x86_64
Linux version 3.10.0-514.6.1.el7.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-11) (GCC) ) #1 SMP Sat Dec 10 11:15:38 EST 2016
Linux 3.10.0-514.6.1.el7.x86_64 #1 SMP Sat Dec 10 11:15:38 EST 2016 x86_64 x86_64 x86_64 GNU/Linux
Red Hat Enterprise Linux Server release 7.3 (Maipo)

On engine:
rhev-guest-tools-iso-4.1-3.el7ev.noarch
rhevm-doc-4.1.0-1.el7ev.noarch
rhevm-dependencies-4.1.0-1.el7ev.noarch
rhevm-setup-plugins-4.1.0-1.el7ev.noarch
rhevm-4.1.0.1-0.1.el7.noarch
rhevm-guest-agent-common-1.0.12-3.el7ev.noarch
rhevm-branding-rhev-4.1.0-0.el7ev.noarch
Linux version 3.10.0-514.6.1.el7.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-11) (GCC) ) #1 SMP Sat Dec 10 11:15:38 EST 2016
Linux 3.10.0-514.6.1.el7.x86_64 #1 SMP Sat Dec 10 11:15:38 EST 2016 x86_64 x86_64 x86_64 GNU/Linux
Red Hat Enterprise Linux Server release 7.3 (Maipo)