Bug 1571381

Summary: Install fails on FAILED - RETRYING: Wait for API to become available
Product: OpenShift Container Platform Reporter: Steven Walter <stwalter>
Component: InstallerAssignee: Scott Dodson <sdodson>
Status: CLOSED NOTABUG QA Contact: Johnny Liu <jialiu>
Severity: high Docs Contact:
Priority: unspecified    
Version: 3.9.0CC: aos-bugs, jokerman, maupadhy, mmccomas, stwalter, wmeng
Target Milestone: ---   
Target Release: 3.9.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-06-05 14:06:30 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Steven Walter 2018-04-24 16:10:00 UTC
Description of problem:
TASK [openshift_master : Wait for API to become available]

*FAILED - RETRYING: Wait for API to become available (xx retries left).*

However, re-running the playbook allows it to continue.

Version-Release number of the following components:
openshift-ansible-3.7.42-1.git.2.9ee4e71.el7.noarch
Tried with:

ansible 2.5.0-2, 2.4.1.0, and 2.3.1.0


How reproducible:
Unconfirmed


Actual results:


TASK [openshift_master : Wait for API to become available]
************************************************************
************************************************************
*********************

*FAILED - RETRYING: Wait for API to become available (120 retries left).*
. . .
*FAILED - RETRYING: Wait for API to become available (1 retries left).*

* [WARNING]: Consider using get_url or uri module rather than running curl*

fatal: [ec2-54-242-126-28.compute-1.amazonaws.com]: FAILED! => {
    "attempts": 120,
    "changed": false,
    "cmd": [
        "curl",
        "--silent",
        "--tlsv1.2",
        "--cacert",
        "/etc/origin/master/ca-bundle.crt",
        "https://console.sandbox.example.com/healthz/ready"
    ],
    "delta": "0:00:00.104755",
    "end": "2018-03-29 22:11:12.608324",
    "rc": 35,
    "start": "2018-03-29 22:11:12.503569"
}
MSG:
non-zero return code



Expected results:
Install completes

Additional info:
I know this is a somewhat common error, however the common explanations I've found for the issue do not apply. And, it is suspicious given that the second runthrough passes this step normally and install completes.

I will upload logs etc

Comment 3 Scott Dodson 2018-04-24 20:02:46 UTC
How long does the load balancer take to insert hosts into the pool? I assume

Comment 4 Steven Walter 2018-04-25 21:46:48 UTC
I don't understand the question -- was "I assume" cut off?

Comment 5 Scott Dodson 2018-04-26 12:51:13 UTC
Yes, sorry about that.

I'm assuming that clusterHostname = 'console.sandbox.example.com' ? And that it's some sort of load balancer. Can you verify that the load balancer is routing traffic as expected to the hosts at the time of the failure? The fact that simply re-running things fixes it leads me to believe that there's a delay in the load balancer routing traffic to the API servers.

Comment 6 Madhusudan Upadhyay 2018-05-14 10:24:14 UTC
Hi,
A little input from the customers end:

I tried to hit the loadbalancer host name. I'm getting nothing back at the current stage. 

[ec2-user@ip-xx-xx-10-xxx ~]$ curl -k http://console.x.sandbox.x.com/
curl: (7) Failed connect to console.x.sandbox.x.com:80; Connection refused


Please let me know if more information is required for the same.

Comment 7 Scott Dodson 2018-05-14 12:35:56 UTC
(In reply to Madhusudan Upadhyay from comment #6)
> Hi,
> A little input from the customers end:
> 
> I tried to hit the loadbalancer host name. I'm getting nothing back at the
> current stage. 
> 
> [ec2-user@ip-xx-xx-10-xxx ~]$ curl -k http://console.x.sandbox.x.com/
> curl: (7) Failed connect to console.x.sandbox.x.com:80; Connection refused
> 
> 
> Please let me know if more information is required for the same.

Then this sounds as if we need to debug why their load balancer isn't routing traffic to the api servers. Can you please work on that with the customer? BTW, that curl is against http not https, where as the ansible scripts are checking https so that may be the reason. Anyway, we need to understand where this is failing. Can they curl the api servers directly?