Bug 1410250

Summary: RHV: Ansible timeout trying to configure engine
Product: Red Hat Quickstart Cloud Installer Reporter: Chandler Wilkerson <cwilkers>
Component: Installation - RHEVAssignee: jkim
Status: CLOSED ERRATA QA Contact: Tasos Papaioannou <tpapaioa>
Severity: medium Docs Contact: Dan Macpherson <dmacpher>
Priority: unspecified    
Version: 1.1CC: bthurber, jkim, qci-bugzillas, tpapaioa
Target Milestone: ---Keywords: Triaged
Target Release: 1.1   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-02-28 01:43:00 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Chandler Wilkerson 2017-01-04 22:10:52 UTC
Description of problem:

RHV installation errors at 80% with an ansible timeout trying to reach the engine during the Actions::Fusor::Deployment::Rhev::TriggerAnsibleRun step.

The step is resume-able, and will reach successful installation if resumed once the engine is available.

Version-Release number of selected component (if applicable):
QCI-1.1-RHEL-7-20161215.t.0 iso

How reproducible:
Twice in my tests, using physical Dell blades

Additional info:

Comment 2 Chandler Wilkerson 2017-01-05 16:27:28 UTC
Confirmed same bug in Jan 04 compose as well.

Comment 3 John Matthews 2017-01-06 21:17:40 UTC
There's a chance this issue relates to lazy sync and packages not being available when we need them.

When working on this also look at BZ 1410140 the retry mentioned there is likely to fix this if the problem is a lazy sync issue.

Comment 4 Chandler Wilkerson 2017-01-06 21:51:53 UTC
The error I see relates to not being able to ssh into the RHV engine host to kick off the Ansible playbook. Is it possible that with a slower bare-iron boot environment, that the timeouts you have in devel are not long enough? (can these be extended without a negative impact?)

It isn't rare or intermittent in my environment, I get this every time I deploy RHV. (also confirmed in the 20170105 ISO)

Comment 5 jkim 2017-01-10 21:07:28 UTC
https://github.com/fusor/fusor/pull/1329

Added a retry in calling the trigger_ansible_run()

Comment 7 Chandler Wilkerson 2017-01-12 03:49:54 UTC
RHV is able to successfully install off the 20170111-7 ISO. Thanks!

Comment 8 jkim 2017-01-13 22:45:25 UTC
https://github.com/fusor/fusor/pull/1343

After recreating the setup, the issue was in the distribute_key_to_host method not catching the Errno::ETIMEDOUT exception.  The PR was successfully tested on the host which consistently produced the bug.

Comment 11 Tasos Papaioannou 2017-01-25 15:20:32 UTC
Verified on QCI-1.1-RHEL-7-20170123.t.0.

To test, I waited until the RHV engine was running ansible tasks, then rebooted it. The deployment log shows ansible-playbook retrying and completing successfully:

****
E, [2017-01-24T17:44:36.020347 #22988] ERROR -- : Error running command: ansible-playbook /usr/share/ansible-ovirt/engine_and_hypervisor.yml 
[...]
TASK [subscription : disable all] **********************************************
fatal: [mac525400c24eaa.example.com]: UNREACHABLE! => {"changed": false, "msg": "Failed to connect to the host via ssh: Shared connection to mac525400c24eaa.example.com closed.\r\n", "unreachable": true}
[...]
W, [2017-01-24T17:44:36.021442 #22988]  WARN -- : Attempt [1 of 30] of the above command FAILED!... Retrying...
I, [2017-01-24T18:26:47.238696 #22988]  INFO -- : Command: ansible-playbook /usr/share/ansible-ovirt/engine_and_hypervisor.yml 
[...]
I, [2017-01-24T18:26:47.239112 #22988]  INFO -- : Status code: 0
****

Comment 13 errata-xmlrpc 2017-02-28 01:43:00 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2017:0335