Description of problem: ==== OpenShift ansible installer sometimes gets following errors and becomes unreachable. ~~~ 2016-09-24 04:09:30,135 p=18851 u=root | changed: [xx.xx.xx.xx] 2016-09-24 06:09:42,930 p=18851 u=root | fatal: [yy.yy.yy.yy]: UNREACHABLE! => {"changed": false, "msg": "Failed to connect to the host via ssh.", "unreachable": true} 2016-09-24 06:09:42,931 p=18851 u=root | fatal: [zz.zz.zz.zz]: UNREACHABLE! => {"changed": false, "msg": "Failed to connect to the host via ssh.", "unreachable": true} ... snip ... 2016-09-24 06:11:04,531 p=18851 u=root | yy.yy.yy.yy : ok=56 changed=8 unreachable=1 failed=0 2016-09-24 06:11:04,531 p=18851 u=root | zz.zz.zz.zz : ok=56 changed=8 unreachable=1 failed=0 ~~~ Version-Release number of selected component (if applicable): === ansible-2.2.0-0.5.prerelease.el7.noarch openshift-ansible-3.2.28-1.git.0.5a85fc5.el7.noarch How reproducible: ==== Steps to Reproduce: 1. Run ansible intaller with multiple Masters Actual results: === Ansible installer got above unreachable errors after iptables restarted Expected results: === Ansible installer didn't get error Additional info: === Attached in private: - Ansible inventory file - sosreport on xx.xx.xx.xx and yy.yy.yy.yy hosts - ansible log
NOTE: Although this issue can be solved with ansible_connection=local for local master like https://bugzilla.redhat.com/show_bug.cgi?id=1312203, this ticket is caused on the remote masters.
Need to document comment 5
*** Bug 1394966 has been marked as a duplicate of this bug. ***
Hi, The customer I attached to this case 2016-10-27 on, is seeing this problem and needs a resolution as soon as we can work towards one. Are there any other ideas of things we can try?
After some discussion, I came up with a possible solution as seen here https://github.com/openshift/openshift-ansible/pull/2956 . If we could have the customer test with this, it would be helpful.
For more information, the working theory is that firewalld is enabled on the hosts before installation and that disabling it is causing the ssh disconnect. If the above patch fails, having them manually disable firewalld before installation (and possibly enabling iptables afterward) would confirm or dent this theory.
The customer tested adding a pause after the disable firewalld and it fixed their issue. This is from the customer: --------Marriott-------- We tested this and it worked. All we did was copy your pause further down in the file and added it below the second task. Might be something good to incorporate into the base install. [root@master01-devtest-vxby ~]# head iptables_hanging_fix.yml --- - name: Check if firewalld is installed command: rpm -q firewalld args: # Disables the following warning: # Consider using yum, dnf or zypper module rather than running rpm warn: no register: pkg_check failed_when: pkg_check.rc > 1 changed_when: no - name: Ensure firewalld service is not enabled service: name: firewalld state: stopped enabled: no when: "{{ pkg_check.rc == 0 }}" - name: Red Hat Support 01727898 Pause pause: seconds=10 when: "{{ result | changed }}" ------------------------------------------
https://github.com/openshift/openshift-ansible/pull/3196
Verified with version atomic-openshift-utils-3.2.47-1.git.0.34a924d, the code has effect, installation succeed. [root@ansible ~]# ansible-playbook -i hosts -v /usr/share/ansible/openshift-ansible/playbooks/byo/config ... TASK [os_firewall : Wait 10 seconds after disabling firewalld] ***************** Tuesday 07 February 2017 03:25:19 +0000 (0:00:02.785) 0:03:40.765 ****** Pausing for 10 seconds (ctrl+C then 'C' = continue early, ctrl+C then 'A' = abort) ...
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2017:0448