Description of problem: Could't install OCP 3.6 with atomic-openshift-installer. it about after 'Gathering information from hosts.'. The /tmp/installer.txt show atomic-openshift-installer try to read the file /root/.config/openshift/.ansible/callback_facts.yaml . but I found the file is blank. It seems openshift_facts.yml did not get facts corretly. Version-Release number of selected component (if applicable): atomic-openshift-utils-3.6.52-1.git.0.ba5c659.el7.noarch How reproducible: always Steps to Reproduce: 1. atomic-openshift-installer -d install Actual results: *** Installation Summary *** Hosts: - openshift-142.lab.sjc.redhat.com - OpenShift master - OpenShift node - Etcd - openshift-103.lab.sjc.redhat.com - OpenShift master - OpenShift node - Etcd - openshift-135.lab.sjc.redhat.com - OpenShift master - OpenShift node - Etcd - openshift-148.lab.sjc.redhat.com - Load Balancer (HAProxy) - openshift-148.lab.sjc.redhat.com - Storage Total OpenShift masters: 3 Total OpenShift nodes: 3 NOTE: Multiple masters specified, this will be an HA deployment with a separate etcd cluster. You will be prompted to provide the FQDN of a load balancer and a host for storage once finished entering hosts. WARNING: Dedicated nodes are recommended for an HA deployment. If no dedicated nodes are specified, each configured master will be marked as a schedulable node. Gathering information from hosts... There was a problem fetching the required information. Please see /tmp/ansible.log for details. Expected results: Additional info:
Created attachment 1276203 [details] Installer logs
I think I've ran into this problem before. Can you try erasing the `/root/.config/openshift/.ansible/` directory and re-running the test? The error you pasted in the OP may seem confusing at first, let's break it down together: > Gathering information from hosts... > There was a problem fetching the required information. > Please see /tmp/ansible.log for details. The 'Gathering information' step is this log message (installer.txt): > 2017-05-04 03:54:56,298 - installer - DEBUG - Going to subprocess > out to ansible now with these args: > ansible-playbook --inventory-file=/root/.config/openshift/hosts > /usr/share/ansible/openshift-ansible//playbooks/byo/openshift_facts.yml The 'problem fetching the required information' step is what happened after this log record: > 2017-05-04 03:57:31,169 - installer - DEBUG - Going to try > to read this file: /root/.config/openshift/.ansible/callback_facts.yaml Unfortunately, there's not much more detail in the logs other than that ^^^ one line at the end of `/tmp/installer.txt`. You said "but I found the file is blank." This makes sense when I step through the code and the log file side-by-side. > 2017-05-04 03:54:05,350 - installer - DEBUG - 'ansible_inventory_directory' does not exist, creating it now (/root/.config/openshift/.ansible) Indicates that the path defined as `ansible_inventory_directory` in the code does not exist. It is created in the next step. > No 'ansible_callback_facts_yaml' in self.settings Indicates that the path to the `callback_facts.yaml` file wasn't set already. So the value is populated at this time. Now the directory exists (.config/openshift/.ansible) and the path to where the callback facts file is set. The file will be empty at this point. This matches what you reported. Next we get to the 'load_system_facts' function. "Retrieves system facts from the remote systems." If ansible fails to run (that's the "going to subprocess out to ansible now" log record) it will print out a message: But before that even happens we log in the installer all of the environment variables that will be set when the subprocess runs > debug_env(env_vars) We see these debug messages in your installer.txt file Next we subprocess out to ansible. If it failed we would have seen this happen: > installer_log.debug("Exit status from subprocess was not 0") but we did not see that message either. That makes sense, your ansible.log file has no 'failed' items marked in it. So code flow continues... Next we try to read in the collected facts. (the environment variable set the location of the callback facts yaml file, when ansible runs it writes out the collected facts to that location, /root/.config/openshift/.ansible/callback_facts.yaml) This message from the installer.txt is what happens at this point: > 2017-05-04 03:57:31,169 - installer - DEBUG - Going to try to read this file: /root/.config/openshift/.ansible/callback_facts.yaml The code says when this debug record is emitted that a yaml load is ran: > callback_facts = yaml.safe_load(callback_facts_file) IF there were a YAML parsing error at this time we would have seen an error message like this: > print("Error in {}".format(CFG.settings['ansible_callback_facts_yaml']), exc) > print("Try deleting and rerunning the atomic-openshift-installer") but you did not get that message. That means that code flow continued. All of the above processing happend in the `install` function in `ooinstall/cli_installer.py` by means of calling the `openshift_ansible.default_facts` function. Note that > click.echo('Gathering information from hosts...') Happened before calling `default_facts()` (which produced all of the above log records) > # cli_installer.py: > def install(...) > ... > callback_facts, error = openshift_ansible.default_facts(...) > ... `default_facts` returns the results from calling `load_system_facts()` `load_system_facts` returns a tuple: `callback_facts, 0` where `callback_facts` is the dictionary loaded from `ansible_callback_facts_yaml` and `0` is well, just `0` Next in the `install` function we have the result checking: > if error or callback_facts is None: We know that error is 0 here because the `load_system_facts` function would have 'returned' a typle: `[], 1` if the subprocess did not exit '0' and we saw no error messages indicating the subprocess was not '0'. This means that the other part of the if condition must be evaluating true. I.e., '...or callback_facts is None' This matches what you described in your original report. That the callback facts file is empty. So now we know HOW we got to this point. What we need to figure out next is why that file was empty. That'll require some more research and testing.
> I think I've ran into this problem before. Can you try erasing the > `/root/.config/openshift/.ansible/` directory and re-running the test? Ignore that comment. After running through all that code in the previous reply I realize you had an empty .ansible directory to begin with.
Tim, I don't know if this is related, but this is something I came across while looking into this issue earlier today. We recently refactored the groups being used and openshift_facts.yml was one of the files modified. With the way we've changed the groups, it could have had an effect on how vars are being passed because we are not using OSEv3. I don't want to give you a wild goose chase, but this may be related. https://github.com/openshift/openshift-ansible/pull/4045/files#diff-d891322dfc8f51659b38fab6c096cd23R11 I looked at the commits just prior to the version in question and that's how I filtered down to this PR/commit.
Proposed: https://github.com/openshift/openshift-ansible/pull/4118
Merged: https://github.com/openshift/openshift-ansible/pull/4118
Verified and pass on openshift-ansible-3.6.68