Bug 1447939 - [quick install]atomic-openshift-installer abort after Gathering information from hosts.
Summary: [quick install]atomic-openshift-installer abort after Gathering information f...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Installer
Version: 3.6.0
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: ---
Assignee: Russell Teague
QA Contact: Johnny Liu
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-05-04 09:29 UTC by Anping Li
Modified: 2017-08-16 19:51 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-05-15 12:35:04 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Installer logs (10.99 KB, application/x-gzip)
2017-05-04 09:31 UTC, Anping Li
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHEA-2017:1716 0 normal SHIPPED_LIVE Red Hat OpenShift Container Platform 3.6 RPM Release Advisory 2017-08-10 09:02:50 UTC

Description Anping Li 2017-05-04 09:29:48 UTC
Description of problem:
Could't install OCP 3.6 with atomic-openshift-installer. it about after 'Gathering information from hosts.'.  The /tmp/installer.txt show atomic-openshift-installer try to read the file /root/.config/openshift/.ansible/callback_facts.yaml . but I found the file is blank.
It seems openshift_facts.yml did not get facts corretly.



Version-Release number of selected component (if applicable):
atomic-openshift-utils-3.6.52-1.git.0.ba5c659.el7.noarch

How reproducible:
always

Steps to Reproduce:
1. atomic-openshift-installer -d install


Actual results:
*** Installation Summary ***

Hosts:
- openshift-142.lab.sjc.redhat.com
  - OpenShift master
  - OpenShift node
  - Etcd
- openshift-103.lab.sjc.redhat.com
  - OpenShift master
  - OpenShift node
  - Etcd
- openshift-135.lab.sjc.redhat.com
  - OpenShift master
  - OpenShift node
  - Etcd
- openshift-148.lab.sjc.redhat.com
  - Load Balancer (HAProxy)
- openshift-148.lab.sjc.redhat.com
  - Storage

Total OpenShift masters: 3
Total OpenShift nodes: 3

NOTE: Multiple masters specified, this will be an HA deployment with a separate
etcd cluster. You will be prompted to provide the FQDN of a load balancer and
a host for storage once finished entering hosts.
    

WARNING: Dedicated nodes are recommended for an HA deployment. If no dedicated
nodes are specified, each configured master will be marked as a schedulable
node.

Gathering information from hosts...
There was a problem fetching the required information. Please see /tmp/ansible.log for details.

Expected results:


Additional info:

Comment 1 Anping Li 2017-05-04 09:31:13 UTC
Created attachment 1276203 [details]
Installer logs

Comment 2 Tim Bielawa 2017-05-04 16:56:33 UTC
I think I've ran into this problem before. Can you try erasing the `/root/.config/openshift/.ansible/` directory and re-running the test?


The error you pasted in the OP may seem confusing at first, let's break it down together:

> Gathering information from hosts...
> There was a problem fetching the required information.
> Please see /tmp/ansible.log for details.

The 'Gathering information' step is this log message (installer.txt):

> 2017-05-04 03:54:56,298 - installer - DEBUG - Going to subprocess
> out to ansible now with these args: 
> ansible-playbook --inventory-file=/root/.config/openshift/hosts
> /usr/share/ansible/openshift-ansible//playbooks/byo/openshift_facts.yml

The 'problem fetching the required information' step is what happened after this log record:

> 2017-05-04 03:57:31,169 - installer - DEBUG - Going to try
> to read this file: /root/.config/openshift/.ansible/callback_facts.yaml

Unfortunately, there's not much more detail in the logs other than that ^^^ one line at the end of `/tmp/installer.txt`.

You said "but I found the file is blank." This makes sense when I step through the code and the log file side-by-side.

> 2017-05-04 03:54:05,350 - installer - DEBUG - 'ansible_inventory_directory' does not exist, creating it now (/root/.config/openshift/.ansible)

Indicates that the path defined as `ansible_inventory_directory` in the code does not exist. It is created in the next step.

> No 'ansible_callback_facts_yaml' in self.settings

Indicates that the path to the `callback_facts.yaml` file wasn't set already. So the value is populated at this time.

Now the directory exists (.config/openshift/.ansible) and the path to where the callback facts file is set. The file will be empty at this point. This matches what you reported.

Next we get to the 'load_system_facts' function. "Retrieves system facts from the remote systems."

If ansible fails to run (that's the "going to subprocess out to ansible now" log record) it will print out a message:

But before that even happens we log in the installer all of the environment variables that will be set when the subprocess runs

> debug_env(env_vars)

We see these debug messages in your installer.txt file

Next we subprocess out to ansible. If it failed we would have seen this happen:

> installer_log.debug("Exit status from subprocess was not 0")

but we did not see that message either. That makes sense, your ansible.log file has no 'failed' items marked in it. So code flow continues...

Next we try to read in the collected facts. (the environment variable set the location of the callback facts yaml file, when ansible runs it writes out the collected facts to that location, /root/.config/openshift/.ansible/callback_facts.yaml)

This message from the installer.txt is what happens at this point:

> 2017-05-04 03:57:31,169 - installer - DEBUG - Going to try to read this file: /root/.config/openshift/.ansible/callback_facts.yaml

The code says when this debug record is emitted that a yaml load is ran:

> callback_facts = yaml.safe_load(callback_facts_file)

IF there were a YAML parsing error at this time we would have seen an error message like this:

> print("Error in {}".format(CFG.settings['ansible_callback_facts_yaml']), exc)
> print("Try deleting and rerunning the atomic-openshift-installer")

but you did not get that message. That means that code flow continued. 

All of the above processing happend in the `install` function in `ooinstall/cli_installer.py` by means of calling the `openshift_ansible.default_facts` function.

Note that

> click.echo('Gathering information from hosts...')

Happened before calling `default_facts()` (which produced all of the above log records)


> # cli_installer.py:
> def install(...)
>     ...
>     callback_facts, error = openshift_ansible.default_facts(...)
>     ...

`default_facts` returns the results from calling `load_system_facts()`

`load_system_facts` returns a tuple: `callback_facts, 0` where `callback_facts` is the dictionary loaded from `ansible_callback_facts_yaml` and `0` is well, just `0`


Next in the `install` function we have the result checking:

> if error or callback_facts is None:

We know that error is 0 here because the `load_system_facts` function would have 'returned' a typle: `[], 1` if the subprocess did not exit '0' and we saw no error messages indicating the subprocess was not '0'.

This means that the other part of the if condition must be evaluating true. I.e., '...or callback_facts is None'

This matches what you described in your original report. That the callback facts file is empty.

So now we know HOW we got to this point. What we need to figure out next is why that file was empty. That'll require some more research and testing.

Comment 3 Tim Bielawa 2017-05-04 17:14:38 UTC
> I think I've ran into this problem before. Can you try erasing the
> `/root/.config/openshift/.ansible/` directory and re-running the test?

Ignore that comment. After running through all that code in the previous reply I realize you had an empty .ansible directory to begin with.

Comment 4 Russell Teague 2017-05-04 19:26:46 UTC
Tim, I don't know if this is related, but this is something I came across while looking into this issue earlier today.  We recently refactored the groups being used and openshift_facts.yml was one of the files modified.  With the way we've changed the groups, it could have had an effect on how vars are being passed because we are not using OSEv3.  I don't want to give you a wild goose chase, but this may be related.

https://github.com/openshift/openshift-ansible/pull/4045/files#diff-d891322dfc8f51659b38fab6c096cd23R11

I looked at the commits just prior to the version in question and that's how I filtered down to this PR/commit.

Comment 5 Russell Teague 2017-05-08 17:27:47 UTC
Proposed: https://github.com/openshift/openshift-ansible/pull/4118

Comment 6 Russell Teague 2017-05-11 12:10:57 UTC
Merged: https://github.com/openshift/openshift-ansible/pull/4118

Comment 8 Anping Li 2017-05-15 02:01:50 UTC
Verified and pass on openshift-ansible-3.6.68


Note You need to log in before you can comment on or make changes to this bug.