Bug 1367995

Summary: Nodes not ready after 50 retries
Product: OpenShift Container Platform Reporter: Gan Huang <ghuang>
Component: InstallerAssignee: Jason DeTiberus <jdetiber>
Status: CLOSED INSUFFICIENT_DATA QA Contact: Johnny Liu <jialiu>
Severity: medium Docs Contact:
Priority: medium    
Version: 3.3.0CC: aos-bugs, bleanhar, ghuang, jokerman, mmccomas
Target Milestone: ---Keywords: UpcomingRelease
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-09-08 11:57:02 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Gan Huang 2016-08-18 05:26:06 UTC
Description of problem:
I tried to install a HA env(2 masters + 6 nodes + 3 etcd), installed would failed at TASK [openshift_manage_node : Wait for Node Registration]. But all nodes actually had became ready when I checked the status manually.


Version-Release number of selected component (if applicable):
openshift-ansible-3.3.12-1.git.0.b26c8c2.el7.noarch.rpm

How reproducible:
30%

Steps to Reproduce:
1.#cat inventory_hosts
<--snip-->

[masters]
ec2-52-207-222-117.compute-1.amazonaws.com ansible_user=root ansible_ssh_user=root ansible_ssh_private_key_file="/home/slave1/workspace/Launch-Environment-Flexy/private/config/keys/libra.pem" openshift_public_hostname=ec2-52-207-222-117.compute-1.amazonaws.com
ec2-54-234-204-42.compute-1.amazonaws.com ansible_user=root ansible_ssh_user=root ansible_ssh_private_key_file="/home/slave1/workspace/Launch-Environment-Flexy/private/config/keys/libra.pem" openshift_public_hostname=ec2-54-234-204-42.compute-1.amazonaws.com

[nodes]
ec2-52-207-222-117.compute-1.amazonaws.com ansible_user=root ansible_ssh_user=root ansible_ssh_private_key_file="/home/slave1/workspace/Launch-Environment-Flexy/private/config/keys/libra.pem" openshift_public_hostname=ec2-52-207-222-117.compute-1.amazonaws.com openshift_node_labels="{'role': 'node'}" openshift_scheduleable=false
ec2-54-234-204-42.compute-1.amazonaws.com ansible_user=root ansible_ssh_user=root ansible_ssh_private_key_file="/home/slave1/workspace/Launch-Environment-Flexy/private/config/keys/libra.pem" openshift_public_hostname=ec2-54-234-204-42.compute-1.amazonaws.com openshift_node_labels="{'role': 'node'}" openshift_scheduleable=false

ec2-54-165-53-104.compute-1.amazonaws.com ansible_user=root ansible_ssh_user=root ansible_ssh_private_key_file="/home/slave1/workspace/Launch-Environment-Flexy/private/config/keys/libra.pem" openshift_public_hostname=ec2-54-165-53-104.compute-1.amazonaws.com openshift_node_labels="{'role': 'node','registry': 'enabled','router': 'enabled'}"
ec2-54-172-58-88.compute-1.amazonaws.com ansible_user=root ansible_ssh_user=root ansible_ssh_private_key_file="/home/slave1/workspace/Launch-Environment-Flexy/private/config/keys/libra.pem" openshift_public_hostname=ec2-54-172-58-88.compute-1.amazonaws.com openshift_node_labels="{'role': 'node','registry': 'enabled','router': 'enabled'}"
ec2-54-164-208-60.compute-1.amazonaws.com ansible_user=root ansible_ssh_user=root ansible_ssh_private_key_file="/home/slave1/workspace/Launch-Environment-Flexy/private/config/keys/libra.pem" openshift_public_hostname=ec2-54-164-208-60.compute-1.amazonaws.com openshift_node_labels="{'role': 'node'}"
ec2-52-90-219-39.compute-1.amazonaws.com ansible_user=root ansible_ssh_user=root ansible_ssh_private_key_file="/home/slave1/workspace/Launch-Environment-Flexy/private/config/keys/libra.pem" openshift_public_hostname=ec2-52-90-219-39.compute-1.amazonaws.com openshift_node_labels="{'role': 'node'}"

[etcd]
ec2-54-234-204-42.compute-1.amazonaws.com ansible_user=root ansible_ssh_user=root ansible_ssh_private_key_file="/home/slave1/workspace/Launch-Environment-Flexy/private/config/keys/libra.pem" openshift_public_hostname=ec2-54-234-204-42.compute-1.amazonaws.com
ec2-54-165-53-104.compute-1.amazonaws.com ansible_user=root ansible_ssh_user=root ansible_ssh_private_key_file="/home/slave1/workspace/Launch-Environment-Flexy/private/config/keys/libra.pem" openshift_public_hostname=ec2-54-165-53-104.compute-1.amazonaws.com
ec2-54-172-58-88.compute-1.amazonaws.com ansible_user=root ansible_ssh_user=root ansible_ssh_private_key_file="/home/slave1/workspace/Launch-Environment-Flexy/private/config/keys/libra.pem" openshift_public_hostname=ec2-54-172-58-88.compute-1.amazonaws.com

2. Trigger the installation
3.

Actual results:
TASK [openshift_manage_node : Wait for Node Registration] **********************
Thursday 18 August 2016  03:34:47 +0000 (0:00:00.072)       0:28:35.787 ******* 
ok: [ec2-52-207-222-117.compute-1.amazonaws.com] => (item=ip-172-18-10-178.ec2.internal) => {"changed": false, "cmd": ["oc", "get", "node", "ip-172-18-10-178.ec2.internal"], "delta": "0:00:00.164148", "end": "2016-08-17 23:34:50.005681", "item": "ip-172-18-10-178.ec2.internal", "rc": 0, "start": "2016-08-17 23:34:49.841533", "stderr": "", "stdout": "NAME                            STATUS    AGE\nip-172-18-10-178.ec2.internal   Ready     59s", "stdout_lines": ["NAME                            STATUS    AGE", "ip-172-18-10-178.ec2.internal   Ready     59s"], "warnings": []}
ok: [ec2-52-207-222-117.compute-1.amazonaws.com] => (item=ip-172-18-10-183.ec2.internal) => {"changed": false, "cmd": ["oc", "get", "node", "ip-172-18-10-183.ec2.internal"], "delta": "0:00:00.136783", "end": "2016-08-17 23:34:51.713095", "item": "ip-172-18-10-183.ec2.internal", "rc": 0, "start": "2016-08-17 23:34:51.576312", "stderr": "", "stdout": "NAME                            STATUS    AGE\nip-172-18-10-183.ec2.internal   Ready     1m", "stdout_lines": ["NAME                            STATUS    AGE", "ip-172-18-10-183.ec2.internal   Ready     1m"], "warnings": []}
ok: [ec2-52-207-222-117.compute-1.amazonaws.com] => (item=ip-172-18-7-239.ec2.internal) => {"changed": false, "cmd": ["oc", "get", "node", "ip-172-18-7-239.ec2.internal"], "delta": "0:00:00.129385", "end": "2016-08-17 23:34:53.420047", "item": "ip-172-18-7-239.ec2.internal", "rc": 0, "start": "2016-08-17 23:34:53.290662", "stderr": "", "stdout": "NAME                           STATUS    AGE\nip-172-18-7-239.ec2.internal   Ready     1m", "stdout_lines": ["NAME                           STATUS    AGE", "ip-172-18-7-239.ec2.internal   Ready     1m"], "warnings": []}
ok: [ec2-52-207-222-117.compute-1.amazonaws.com] => (item=ip-172-18-7-238.ec2.internal) => {"changed": false, "cmd": ["oc", "get", "node", "ip-172-18-7-238.ec2.internal"], "delta": "0:00:00.130831", "end": "2016-08-17 23:34:55.121077", "item": "ip-172-18-7-238.ec2.internal", "rc": 0, "start": "2016-08-17 23:34:54.990246", "stderr": "", "stdout": "NAME                           STATUS    AGE\nip-172-18-7-238.ec2.internal   Ready     1m", "stdout_lines": ["NAME                           STATUS    AGE", "ip-172-18-7-238.ec2.internal   Ready     1m"], "warnings": []}
FAILED - RETRYING: TASK: openshift_manage_node : Wait for Node Registration (50 retries left).
FAILED - RETRYING: TASK: openshift_manage_node : Wait for Node Registration (49 retries left).
FAILED - RETRYING: TASK: openshift_manage_node : Wait for Node Registration (48 retries left).
FAILED - RETRYING: TASK: openshift_manage_node : Wait for Node Registration (47 retries left).
FAILED - RETRYING: TASK: openshift_manage_node : Wait for Node Registration (46 retries left).
FAILED - RETRYING: TASK: openshift_manage_node : Wait for Node Registration (45 retries left).
FAILED - RETRYING: TASK: openshift_manage_node : Wait for Node Registration (44 retries left).
FAILED - RETRYING: TASK: openshift_manage_node : Wait for Node Registration (43 retries left).
FAILED - RETRYING: TASK: openshift_manage_node : Wait for Node Registration (42 retries left).
FAILED - RETRYING: TASK: openshift_manage_node : Wait for Node Registration (41 retries left).
FAILED - RETRYING: TASK: openshift_manage_node : Wait for Node Registration (40 retries left).
FAILED - RETRYING: TASK: openshift_manage_node : Wait for Node Registration (39 retries left).
FAILED - RETRYING: TASK: openshift_manage_node : Wait for Node Registration (38 retries left).
FAILED - RETRYING: TASK: openshift_manage_node : Wait for Node Registration (37 retries left).
FAILED - RETRYING: TASK: openshift_manage_node : Wait for Node Registration (36 retries left).
FAILED - RETRYING: TASK: openshift_manage_node : Wait for Node Registration (35 retries left).
FAILED - RETRYING: TASK: openshift_manage_node : Wait for Node Registration (34 retries left).
FAILED - RETRYING: TASK: openshift_manage_node : Wait for Node Registration (33 retries left).
FAILED - RETRYING: TASK: openshift_manage_node : Wait for Node Registration (32 retries left).
FAILED - RETRYING: TASK: openshift_manage_node : Wait for Node Registration (31 retries left).
FAILED - RETRYING: TASK: openshift_manage_node : Wait for Node Registration (30 retries left).
FAILED - RETRYING: TASK: openshift_manage_node : Wait for Node Registration (29 retries left).
FAILED - RETRYING: TASK: openshift_manage_node : Wait for Node Registration (28 retries left).
FAILED - RETRYING: TASK: openshift_manage_node : Wait for Node Registration (27 retries left).
FAILED - RETRYING: TASK: openshift_manage_node : Wait for Node Registration (26 retries left).
FAILED - RETRYING: TASK: openshift_manage_node : Wait for Node Registration (25 retries left).
FAILED - RETRYING: TASK: openshift_manage_node : Wait for Node Registration (24 retries left).
FAILED - RETRYING: TASK: openshift_manage_node : Wait for Node Registration (23 retries left).
FAILED - RETRYING: TASK: openshift_manage_node : Wait for Node Registration (22 retries left).
FAILED - RETRYING: TASK: openshift_manage_node : Wait for Node Registration (21 retries left).
FAILED - RETRYING: TASK: openshift_manage_node : Wait for Node Registration (20 retries left).
FAILED - RETRYING: TASK: openshift_manage_node : Wait for Node Registration (19 retries left).
FAILED - RETRYING: TASK: openshift_manage_node : Wait for Node Registration (18 retries left).
FAILED - RETRYING: TASK: openshift_manage_node : Wait for Node Registration (17 retries left).
FAILED - RETRYING: TASK: openshift_manage_node : Wait for Node Registration (16 retries left).
FAILED - RETRYING: TASK: openshift_manage_node : Wait for Node Registration (15 retries left).
FAILED - RETRYING: TASK: openshift_manage_node : Wait for Node Registration (14 retries left).
FAILED - RETRYING: TASK: openshift_manage_node : Wait for Node Registration (13 retries left).
FAILED - RETRYING: TASK: openshift_manage_node : Wait for Node Registration (12 retries left).
FAILED - RETRYING: TASK: openshift_manage_node : Wait for Node Registration (11 retries left).
FAILED - RETRYING: TASK: openshift_manage_node : Wait for Node Registration (10 retries left).
FAILED - RETRYING: TASK: openshift_manage_node : Wait for Node Registration (9 retries left).
FAILED - RETRYING: TASK: openshift_manage_node : Wait for Node Registration (8 retries left).
FAILED - RETRYING: TASK: openshift_manage_node : Wait for Node Registration (7 retries left).
FAILED - RETRYING: TASK: openshift_manage_node : Wait for Node Registration (6 retries left).
FAILED - RETRYING: TASK: openshift_manage_node : Wait for Node Registration (5 retries left).
FAILED - RETRYING: TASK: openshift_manage_node : Wait for Node Registration (4 retries left).
FAILED - RETRYING: TASK: openshift_manage_node : Wait for Node Registration (3 retries left).
FAILED - RETRYING: TASK: openshift_manage_node : Wait for Node Registration (2 retries left).
FAILED - RETRYING: TASK: openshift_manage_node : Wait for Node Registration (1 retries left).
failed: [ec2-52-207-222-117.compute-1.amazonaws.com] (item=ip-172-18-6-178.ec2.internal) => {"changed": false, "cmd": ["oc", "get", "node", "ip-172-18-6-178.ec2.internal"], "delta": "0:00:00.127775", "end": "2016-08-17 23:40:31.439373", "failed": true, "item": "ip-172-18-6-178.ec2.internal", "rc": 1, "start": "2016-08-17 23:40:31.311598", "stderr": "Error from server: nodes \"ip-172-18-6-178.ec2.internal\" not found", "stdout": "", "stdout_lines": [], "warnings": []}
ok: [ec2-52-207-222-117.compute-1.amazonaws.com] => (item=ip-172-18-6-179.ec2.internal) => {"changed": false, "cmd": ["oc", "get", "node", "ip-172-18-6-179.ec2.internal"], "delta": "0:00:00.128504", "end": "2016-08-17 23:40:33.133694", "item": "ip-172-18-6-179.ec2.internal", "rc": 0, "start": "2016-08-17 23:40:33.005190", "stderr": "", "stdout": "NAME                           STATUS    AGE\nip-172-18-6-179.ec2.internal   Ready     6m", "stdout_lines": ["NAME                           STATUS    AGE", "ip-172-18-6-179.ec2.internal   Ready     6m"], "warnings": []}

NO MORE HOSTS LEFT *************************************************************
	to retry, use: --limit @/home/slave1/workspace/Launch-Environment-Flexy/private-openshift-ansible/playbooks/byo/config.retry

PLAY RECAP *********************************************************************
ec2-52-207-222-117.compute-1.amazonaws.com : ok=346  changed=101  unreachable=0    failed=1   
ec2-52-90-219-39.compute-1.amazonaws.com : ok=143  changed=43   unreachable=0    failed=0   
ec2-54-164-208-60.compute-1.amazonaws.com : ok=143  changed=43   unreachable=0    failed=0   
ec2-54-165-53-104.compute-1.amazonaws.com : ok=189  changed=61   unreachable=0    failed=0   
ec2-54-172-58-88.compute-1.amazonaws.com : ok=189  changed=61   unreachable=0    failed=0   
ec2-54-234-204-42.compute-1.amazonaws.com : ok=331  changed=117  unreachable=0    failed=0   
localhost                  : ok=14   changed=8    unreachable=0    failed=0   

Checking the nodes status after the failed installation.
# oc get nodes
NAME                            STATUS    AGE
ip-172-18-10-178.ec2.internal   Ready     53m
ip-172-18-10-183.ec2.internal   Ready     53m
ip-172-18-6-179.ec2.internal    Ready     53m
ip-172-18-7-238.ec2.internal    Ready     53m
ip-172-18-7-239.ec2.internal    Ready     53m

Expected results:
Installtion successed.

Additional info:

Comment 1 Jason DeTiberus 2016-08-19 11:22:09 UTC
It looks like we are checking the return code, which I would expect to be 0, but maybe we need to use a go template or json template to return the status for now?

Comment 2 Jason DeTiberus 2016-08-19 13:28:05 UTC
nevermind my previous comment, I missed the actual failure when reading from my phone earlier.

Comment 3 Jason DeTiberus 2016-08-19 13:32:41 UTC
Looking closer at the output, this appears to be a legitimate failure. The node that failed the check was 'ip-172-18-6-178.ec2.internal', which isn't listed in the 'oc get nodes' output.

Is there an error in the node logs for that host?

Comment 4 Gan Huang 2016-08-22 01:55:26 UTC
Sorry, I didn't catch the logs. But I had checked the node status after the failed installtion.
# oc get nodes
NAME                            STATUS    AGE
ip-172-18-10-178.ec2.internal   Ready     53m
ip-172-18-10-183.ec2.internal   Ready     53m
ip-172-18-6-179.ec2.internal    Ready     53m
ip-172-18-7-238.ec2.internal    Ready     53m
ip-172-18-7-239.ec2.internal    Ready     53m

Looks like all nodes were ready well. I will catch more for you if I met the issue again.

Comment 5 Brenton Leanhardt 2016-09-07 19:40:51 UTC
Hi Huang Gan,

Has this issue happened recently?  If not I suggest we close this.

Comment 6 Gan Huang 2016-09-08 01:48:07 UTC
Hi Breton,

I didn't experience the issue again recently, I agree to close it temporarily.

Comment 7 Brenton Leanhardt 2016-09-08 11:57:32 UTC
Sounds good.  Thanks for the help Huang Gan.