1379189 – [3.2] ansible sometimes gets UNREACHABLE error after iptables restarted

Bug 1379189 - [3.2] ansible sometimes gets UNREACHABLE error after iptables restarted

Summary: [3.2] ansible sometimes gets UNREACHABLE error after iptables restarted

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Installer
Sub Component:
Version:	3.2.1
Hardware:	x86_64
OS:	Linux
Priority:	medium
Severity:	urgent
Target Milestone:	---
Target Release:	3.2.1
Assignee:	Samuel Munilla
QA Contact:	Wenkai Shi
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1394966 (view as bug list)
Depends On:
Blocks:	qci-ocp 1416926 1416927
TreeView+	depends on / blocked

Reported:	2016-09-26 01:15 UTC by Kenjiro Nakayama
Modified:	2017-03-06 16:36 UTC (History)
CC List:	9 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	When executing the installer on a remote host that's also included in the inventory the firewall configuration could potentially cause the installer to hang. We have added a 10 second delay after resetting the firewall which should avoid this problem from occurring.
Clone Of:
Clones:	1416926 1416927 (view as bug list)
Environment:
Last Closed:	2017-03-06 16:36:49 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1312203	0	medium	CLOSED	openshift-ansibel get stuck when running on the host to be deployed	2021-02-22 00:41:40 UTC
Red Hat Product Errata	RHSA-2017:0448	0	normal	SHIPPED_LIVE	Important: ansible and openshift-ansible security and bug fix update	2017-03-06 21:36:25 UTC

Internal Links: 1312203

Description Kenjiro Nakayama 2016-09-26 01:15:44 UTC

Description of problem:
====

  OpenShift ansible installer sometimes gets following errors and becomes unreachable.

  ~~~
  2016-09-24 04:09:30,135 p=18851 u=root |  changed: [xx.xx.xx.xx]
  2016-09-24 06:09:42,930 p=18851 u=root |  fatal: [yy.yy.yy.yy]: UNREACHABLE! => {"changed": false, "msg": "Failed to connect to the host via ssh.", "unreachable": true}
  2016-09-24 06:09:42,931 p=18851 u=root |  fatal: [zz.zz.zz.zz]: UNREACHABLE! => {"changed": false, "msg": "Failed to connect to the host via ssh.", "unreachable": true}

    ... snip ...

  2016-09-24 06:11:04,531 p=18851 u=root |  yy.yy.yy.yy                 : ok=56   changed=8    unreachable=1    failed=0
  2016-09-24 06:11:04,531 p=18851 u=root |  zz.zz.zz.zz                 : ok=56   changed=8    unreachable=1    failed=0
  ~~~

Version-Release number of selected component (if applicable):
===

  ansible-2.2.0-0.5.prerelease.el7.noarch
  openshift-ansible-3.2.28-1.git.0.5a85fc5.el7.noarch

How reproducible:
====

  Steps to Reproduce:
  1. Run ansible intaller with multiple Masters

Actual results:
===

  Ansible installer got above unreachable errors after iptables restarted

Expected results:
===

  Ansible installer didn't get error

Additional info:
===

  Attached in private:
  - Ansible inventory file
  - sosreport on xx.xx.xx.xx and yy.yy.yy.yy hosts
  - ansible log

Comment 5 Kenjiro Nakayama 2016-09-27 11:51:49 UTC

NOTE: Although this issue can be solved with ansible_connection=local for local master like https://bugzilla.redhat.com/show_bug.cgi?id=1312203, this ticket is caused on the remote masters.

Comment 6 Scott Dodson 2016-09-27 15:05:12 UTC

Need to document comment 5

Comment 14 Scott Dodson 2016-11-15 16:06:13 UTC

*** Bug 1394966 has been marked as a duplicate of this bug. ***

Comment 16 Eric Jones 2016-11-17 19:06:24 UTC

Hi,

The customer I attached to this case 2016-10-27 on, is seeing this problem and needs a resolution as soon as we can work towards one.

Are there any other ideas of things we can try?

Comment 19 Samuel Munilla 2016-12-08 20:20:47 UTC

After some discussion, I came up with a possible solution as seen here https://github.com/openshift/openshift-ansible/pull/2956 . If we could have the customer test with this, it would be helpful.

Comment 20 Samuel Munilla 2016-12-08 21:41:36 UTC

For more information, the working theory is that firewalld is enabled on the hosts before installation and that disabling it is causing the ssh disconnect. If the above patch fails, having them manually disable firewalld before installation (and possibly enabling iptables afterward) would confirm or dent this theory.

Comment 21 Jason Meyer 2016-12-28 13:49:21 UTC

The customer tested adding a pause after the disable firewalld and it fixed their issue.  This is from the customer:

--------Marriott--------

We tested this and it worked. All we did was copy your pause further down in the file and added it below the second task. Might be something good to incorporate into the base install.

[root@master01-devtest-vxby ~]# head iptables_hanging_fix.yml
---
- name: Check if firewalld is installed
  command: rpm -q firewalld
  args:
    # Disables the following warning:
    # Consider using yum, dnf or zypper module rather than running rpm
   warn: no
  register: pkg_check
  failed_when: pkg_check.rc > 1
  changed_when: no

- name: Ensure firewalld service is not enabled
  service:
    name: firewalld
    state: stopped
    enabled: no
  when: "{{ pkg_check.rc == 0 }}"

- name: Red Hat Support 01727898 Pause
  pause: seconds=10
  when: "{{ result | changed }}"

------------------------------------------

Comment 22 Scott Dodson 2017-01-26 19:42:19 UTC

https://github.com/openshift/openshift-ansible/pull/3196

Comment 24 Wenkai Shi 2017-02-07 06:06:42 UTC

Verified with version atomic-openshift-utils-3.2.47-1.git.0.34a924d, the code has effect, installation succeed.

[root@ansible ~]# ansible-playbook -i hosts -v /usr/share/ansible/openshift-ansible/playbooks/byo/config
...
TASK [os_firewall : Wait 10 seconds after disabling firewalld] *****************
Tuesday 07 February 2017  03:25:19 +0000 (0:00:02.785)       0:03:40.765 ****** 
Pausing for 10 seconds
(ctrl+C then 'C' = continue early, ctrl+C then 'A' = abort)
...

Comment 26 errata-xmlrpc 2017-03-06 16:36:49 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2017:0448

Note You need to log in before you can comment on or make changes to this bug.