Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1586197

Summary: Installer fails - node service does not start in time on one of the masters
Product: OpenShift Container Platform Reporter: Vikas Laad <vlaad>
Component: Cluster Version OperatorAssignee: Russell Teague <rteague>
Status: CLOSED CURRENTRELEASE QA Contact: Vikas Laad <vlaad>
Severity: high Docs Contact:
Priority: unspecified    
Version: 3.10.0CC: aos-bugs, jokerman, mifiedle, mmccomas, wmeng, wsun
Target Milestone: ---Keywords: TestBlocker, Unconfirmed
Target Release: 3.10.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
The Ansible async timeout was very short which would intermittently cause the async status check later to fail because the original task job would never report completion. The async job timeout was increased to ensure there was enough time for the job to either complete successfully or fail with an appropriate error message.
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-12-20 21:36:57 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
ansible log with -vvv
none
ansible log with -vvv none

Description Vikas Laad 2018-06-05 17:03:00 UTC
Description of problem:
I am trying to upgrade ha cluster with following nodes. I have tried it couple of times. While upgrading one of the masters the installer fails complaining node service cant be started, but after some time node service starts. If I re-run the installer after that it moves to next task.

1 lb
3 masters
3 etcd
2 infra
2 compute

Version-Release number of the following components:
rpm -q openshift-ansible - latest 340e2f3e86d1119541c300d95b4e7c877b0a6b99
rpm -q ansible 
ansible-2.4.3.0-1.el7ae.noarch

ansible --version
  config file = /root/openshift-ansible/ansible.cfg
  configured module search path = [u'/root/.ansible/plugins/modules', u'/usr/share/ansible/plugins/modules']
  ansible python module location = /usr/lib/python2.7/site-packages/ansible
  executable location = /usr/bin/ansible
  python version = 2.7.5 (default, Apr 19 2018, 05:40:55) [GCC 4.8.5 20150623 (Red Hat 4.8.5-28)]

How reproducible:
with HA cluster, single master cluster works fine.

Steps to Reproduce:
1. create HA 3.9 cluster
2. upgrade cluster to 3.10

Actual results:
Please include the entire output from the last TASK line through the end of output if an error is generated
  Hosts:    ec2-54-200-4-41.us-west-2.compute.amazonaws.com
     Play:     Update master nodes
     Task:     Check status of node service
     Message:  ^[[0;31mFailed without returning a message.^[[0m

Expected results:
Installer should complete.

Additional info:
Please attach logs from ansible-playbook with the -vvv flag

Comment 2 Vikas Laad 2018-06-05 17:06:06 UTC
Created attachment 1447950 [details]
ansible log with -vvv

Comment 6 Russell Teague 2018-06-08 14:27:32 UTC
Proposed: https://github.com/openshift/openshift-ansible/pull/8691

Comment 7 openshift-github-bot 2018-06-08 19:43:34 UTC
Commits pushed to master at https://github.com/openshift/openshift-ansible

https://github.com/openshift/openshift-ansible/commit/f1ee19b52f49941ac4cf56c41770d5aa3e86f761
Bug 1586197 - Increase async timeout

https://github.com/openshift/openshift-ansible/commit/bfa27c9beaa14483134dd8af5e1492716a591cbe
Merge pull request #8691 from mtnbikenc/fix-1586197

Bug 1586197 - Increase async timeout

Comment 8 Russell Teague 2018-06-11 21:02:30 UTC
openshift-ansible-3.10.0-0.66.0

Comment 9 Vikas Laad 2018-06-12 15:27:57 UTC
Created attachment 1450530 [details]
ansible log with -vvv

Tried again with the latest code from openshift-master, here is the head 79d6516f4164b82c7dbfdc120f8f4f229116abc1

Saw the same failure please see latest ansible log attached.

Comment 10 Russell Teague 2018-06-12 20:31:19 UTC
*** Bug 1589531 has been marked as a duplicate of this bug. ***

Comment 11 Russell Teague 2018-06-12 20:35:48 UTC
I've attempted several times to reproduce this but have been unsuccessful.  If this is reproduced again, please provide the contents of the ansible_sync job id file located in /root/.ansible_sync/.  The job id corresponds to the id in the task output.

Comment 12 Wei Sun 2018-06-13 02:33:40 UTC
Add testblocker keyword since the duplicate bug is blocking the upgrade testing against HA clusters per https://bugzilla.redhat.com/show_bug.cgi?id=1589531#c1

Comment 13 Vikas Laad 2018-06-13 18:28:16 UTC
I completed 2 upgrades on HA cluster both completed fine.

I verified with latest git hash a1634c352a0ebc4476c9d961a74f2c3817ad35e8 from openshift-ansible

Comment 14 Russell Teague 2018-06-13 18:49:57 UTC
This should be fixed in openshift-ansible-3.10.0-0.66.0 or newer.