Bug 1528960

Summary: Add ability to change maximum timeout for Ansible process executed from engine to finish
Product: [oVirt] ovirt-engine Reporter: Nelly Credi <ncredi>
Component: Host-DeployAssignee: Ondra Machacek <omachace>
Status: CLOSED CURRENTRELEASE QA Contact: Pavol Brilla <pbrilla>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 4.2.0CC: bugs, eedri, lsvaty, lveyde, mperina, ncredi, omachace, rhv-bugzilla-bot
Target Milestone: ovirt-4.2.1Keywords: Automation
Target Release: ---Flags: rule-engine: ovirt-4.2+
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: ovirt-engine-4.2.1.1 Doc Type: Enhancement
Doc Text:
The default timeout for Ansible process executed from engine has been enlarged to 30 minutes, because especially upgrading hosts can take significant amount of time. If Ansible process doesn't finish until this timeout, engine will kill the Ansible process and fail the action. If even default 30 minutes timeout is not enough, administrators can further enlarge it by creating a new configuration file in /etc/ovirt-engine/engine.conf.d (for example 99-ansible-playbook-timeout.conf) with following content: ANSIBLE_PLAYBOOK_EXEC_DEFAULT_TIMEOUT=NNN where NNN is number minutes which engine should wait for Ansible process to finish.
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-02-12 11:50:59 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Infra RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
playbook log
none
engine log none

Description Nelly Credi 2017-12-25 11:46:44 UTC
Created attachment 1372137 [details]
playbook log

Description of problem:
when upgrading the hosts from 4.1 to 4.2, the host gets into 'Install Failed' status

Version-Release number of selected component (if applicable):
on the host:
vdsm-4.20.9.3-1.el7ev.x86_64
ovirt-host-4.2.0-1.el7ev.x86_64

on the engine:
ovirt-engine-4.2.0.2-0.1.el7.noarch
ovirt-host-deploy-1.7.0-1.el7ev.noarch

How reproducible:
100%

Steps to Reproduce:
1. upgrade engine from 4.1 to 4.2
2. set repos on the hosts
3. upgrade via rest or webui

Actual results:
host in 'Installed failed'

Expected results:
host should get updated

Additional info:
looks like ansible is in defunct 
[root@jenkins-vm-16 host-deploy]# ps -ef | grep ansible
ovirt    26764 25035  0 12:13 ?        00:00:07 /usr/bin/python2 /usr/bin/ansible-playbook -v --private-key=/etc/pki/ovirt-engine/keys/engine_id_rsa --inventory=/tmp/ansible-inventory1951036924750427419 /usr/share/ovirt-engine/playbooks/ovirt-host-upgrade.yml
ovirt    26775 26764  0 12:13 ?        00:00:00 [ansible-playboo] <defunct>
ovirt    29615 25035  1 13:33 ?        00:00:08 /usr/bin/python2 /usr/bin/ansible-playbook -v --private-key=/etc/pki/ovirt-engine/keys/engine_id_rsa --inventory=/tmp/ansible-inventory7442594928060297520 /usr/share/ovirt-engine/playbooks/ovirt-host-upgrade.yml

Comment 1 Nelly Credi 2017-12-25 11:47:25 UTC
Created attachment 1372138 [details]
engine log

Comment 2 Nelly Credi 2017-12-25 11:55:03 UTC
it is always failing on the first upgrade try, 
but when re-triggering the process it succeeds

Comment 3 Yaniv Kaul 2017-12-26 07:41:51 UTC
What was the source 4.1? Did it have ovirt-host installed?

Comment 4 Nelly Credi 2017-12-26 11:39:49 UTC
build 4.1.8-5

before upgrade:
ovirt-engine-4.1.8.2-0.1.el7.noarch
ovirt-host-deploy-1.6.7-1.el7ev.noarch

ovirt-host is installed only in 4.2 afaik

engine & host after upgrade:
ovirt-host-4.2.0-1.el7ev.x86_64
ovirt-host-dependencies-4.2.0-1.el7ev.x86_64
ovirt-host-deploy-1.7.0-1.el7ev.noarch

Comment 8 Martin Perina 2017-12-26 13:24:43 UTC
Could you please provide also host-deploy log? You have provided only ansible part of host-deploy log?

Comment 9 Nelly Credi 2017-12-29 10:04:01 UTC
Im afraid i dont have them
and i was unable to reproduce since

Comment 11 Martin Perina 2018-01-08 12:55:54 UTC
We have enlarged the default timeout to 30 minutes and also users are now able to change that timeout even further.

Comment 12 Pavol Brilla 2018-01-17 08:38:05 UTC
# grep ANSIBLE /usr/share/ovirt-engine/services/ovirt-engine/ovirt-engine.conf
# yum list ovirt-engine
ovirt-engine.noarch    4.2.0.2-0.1.el7           @rhv-4.2.x

-----

# yum list ovirt-engine
ovirt-engine.noarch            4.2.1.1-0.1.el7          @rhv-4.2x
# grep ANSIBLE /usr/share/ovirt-engine/services/ovirt-engine/ovirt-engine.conf
ANSIBLE_PLAYBOOK_EXEC_DEFAULT_TIMEOUT=30

Comment 13 Sandro Bonazzola 2018-02-12 11:50:59 UTC
This bugzilla is included in oVirt 4.2.1 release, published on Feb 12th 2018.

Since the problem described in this bug report should be
resolved in oVirt 4.2.1 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.