Bug 1698814

Summary: Scale up nodes failed at task [Apply ignition manifest]
Product: OpenShift Container Platform Reporter: Weihua Meng <wmeng>
Component: InstallerAssignee: Russell Teague <rteague>
Installer sub component: openshift-ansible QA Contact: Weihua Meng <wmeng>
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: high CC: chaoyang, gpei, lxia, vrutkovs, wehe
Version: 4.1.0   
Target Milestone: ---   
Target Release: 4.1.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-06-04 10:47:22 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
ec2-console-log none

Description Weihua Meng 2019-04-11 09:05:39 UTC
Description of problem:
Scale up nodes failed at task [Apply ignition manifest]

Version-Release number of the following components:
openshift-ansible-4.1.0-201904091404

ansible 2.7.9
  config file = /home/wmeng/openshift/openshift-ansible/ansible.cfg
  configured module search path = ['/home/wmeng/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules']
  ansible python module location = /usr/local/lib/python3.6/site-packages/ansible
  executable location = /usr/local/bin/ansible
  python version = 3.6.6 (default, Mar 29 2019, 00:03:27) [GCC 4.8.5 20150623 (Red Hat 4.8.5-36)]

https://github.com/openshift/openshift-ansible/blob/master/roles/openshift_node/tasks/config.yml#L78

How reproducible:
Always

Steps to Reproduce:
following https://github.com/openshift/openshift-ansible/blob/master/README.md
1. set up OCP4 cluster
2. prepare new_worker vms
3. run scaleup playbook
$ansible-playbook -i inventory/hosts playbooks/scaleup.yml

Actual results:
failed due to host UNREACHABLE

<ec2-54-199-196-17.ap-northeast-1.compute.amazonaws.com> SSH: EXEC ssh -o ControlMaster=auto -o ControlPersist=600s -o StrictHostKeyChecking=no -o 'IdentityFile="/home/wmeng/shared-secrets/aws/libra.pem"' -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o User=ec2-user -o ConnectTimeout=30 -o ControlPath=/home/wmeng/.ansible/cp/%h-%r -tt ec2-54-199-196-17.ap-northeast-1.compute.amazonaws.com '/bin/sh -c '"'"'sudo -H -S -n -u root /bin/sh -c '"'"'"'"'"'"'"'"'echo BECOME-SUCCESS-pwwpajzkopvjdnttsmmyispqpjrskwdv; /usr/bin/python /home/ec2-user/.ansible/tmp/ansible-tmp-1554971512.1921854-243809547831033/async_wrapper.py 339131747332 900 /home/ec2-user/.ansible/tmp/ansible-tmp-1554971512.1921854-243809547831033/AnsiballZ_command.py _'"'"'"'"'"'"'"'"' && sleep 0'"'"''
Escalation succeeded
<ec2-54-199-196-17.ap-northeast-1.compute.amazonaws.com> (255, b'{"started": 1, "_ansible_suppress_tmpdir_delete": true, "finished": 0, "results_file": "/root/.ansible_async/339131747332.4000", "ansible_job_id": "339131747332.4000"}\r\n', b'Shared connection to ec2-54-199-196-17.ap-northeast-1.compute.amazonaws.com closed.\r\n')
fatal: [ec2-54-199-196-17.ap-northeast-1.compute.amazonaws.com]: UNREACHABLE! => {
    "changed": false,
    "msg": "Failed to connect to the host via ssh: Shared connection to ec2-54-199-196-17.ap-northeast-1.compute.amazonaws.com closed.",
    "unreachable": true
}

PLAY RECAP ********************************************************************************************************************************************************************************************************
ec2-18-182-49-196.ap-northeast-1.compute.amazonaws.com : ok=18   changed=7    unreachable=1    failed=0   
ec2-54-199-196-17.ap-northeast-1.compute.amazonaws.com : ok=18   changed=7    unreachable=1    failed=0   
ec2-54-95-137-65.ap-northeast-1.compute.amazonaws.com : ok=18   changed=7    unreachable=1    failed=0   
localhost                  : ok=0    changed=0    unreachable=0    failed=0   

Thursday 11 April 2019  04:39:00 -0400 (0:07:08.795)       0:08:59.703 ******** 
=============================================================================== 
openshift_node : Apply ignition manifest ----------------------------------------------------------------------------------------------------------------------------------------------------------------- 428.80s
/home/wmeng/openshift/openshift-ansible/roles/openshift_node/tasks/config.yml:78 ---------------------------------------------------------------------------------------------------------------------------------


Expected results:
ansible playbook complete successfully.

Comment 1 Russell Teague 2019-04-11 12:35:09 UTC
Please attach full ansible -vvv logs.

Please ensure openshift-ansible/ansible.cfg is being used during playbook execution.  The task which failed could be related to not having ssh retries configured.

https://github.com/openshift/openshift-ansible/blob/master/ansible.cfg#L38

Comment 3 Russell Teague 2019-04-12 12:20:48 UTC
Logs show task [Apply ignition manifest] was tried 15 times and failed after 7 minutes.

Comment 4 Vadim Rutkovsky 2019-04-12 12:31:30 UTC
It appears the node was rebooted, but it doesn't come up after it. Not much Ansible can do about this.

Is this reproducible? Is there a console log from the node when task failed?

Comment 8 Weihua Meng 2019-04-16 08:03:37 UTC
Created attachment 1555403 [details]
ec2-console-log

Comment 9 Weihua Meng 2019-04-16 08:04:59 UTC
(In reply to Vadim Rutkovsky from comment #4)
> It appears the node was rebooted, but it doesn't come up after it. Not much
> Ansible can do about this.
> 
> Is this reproducible? Is there a console log from the node when task failed?

It can be reproduced.

console log attatched.

Comment 10 Russell Teague 2019-04-17 12:36:27 UTC
Changes in the way reboots are handled during apply for ignition manifest have merged.
openshift-ansible-4.1.0-201904161832

Please test again.

Comment 12 Weihua Meng 2019-04-22 00:39:21 UTC
Fixed.

openshift-ansible-4.1.0-201904201251.git.148.6de1227.el7.noarch

scaleup playbook can finish successfully.

PLAY RECAP ********************************************************************************************************************************************************************************************************
ec2-52-194-221-152.ap-northeast-1.compute.amazonaws.com : ok=20   changed=14   unreachable=0    failed=0   
localhost                  : ok=0    changed=0    unreachable=0    failed=0   

Sunday 21 April 2019  20:33:46 -0400 (0:00:40.746)       0:05:21.560 **********

Comment 14 errata-xmlrpc 2019-06-04 10:47:22 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0758