1393000 – [3.3] Ansible upgrade from 3.2 to 3.3 fails

Bug 1393000 - [3.3] Ansible upgrade from 3.2 to 3.3 fails

Summary: [3.3] Ansible upgrade from 3.2 to 3.3 fails

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cluster Version Operator
Sub Component:
Version:	3.3.1
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	high
Target Milestone:	---
Target Release:	3.3.1
Assignee:	Andrew Butcher
QA Contact:	liujia
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2016-11-08 16:34 UTC by Brendan Mchugh
Modified:	2020-04-15 14:49 UTC (History)
CC List:	11 users (show)
Fixed In Version:	openshift-ansible-3.3.62-1.git.0.b7473e7.el7
Doc Type:	Bug Fix
Doc Text:	Previously, API verification during upgrades was performed from the ansible control host which may not have network access to each API server in some network topologies. Now API server verification happens from the master hosts avoiding problems with network access.
Clone Of:
Environment:
Last Closed:	2017-03-06 16:37:11 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2017:0448	0	normal	SHIPPED_LIVE	Important: ansible and openshift-ansible security and bug fix update	2017-03-06 21:36:25 UTC

Description Brendan Mchugh 2016-11-08 16:34:21 UTC

Description of problem:
Ansible upgrade from 3.2 to 3.3 fails with "Timeout when waiting for NODE"


Version-Release number of selected component (if applicable):
openshift-ansible-playbooks-3.3.22-1.git.0.6c888c2.el7.noarch


How reproducible:
Always but different nodes may fail


Steps to Reproduce:
1. Install 3.2 
2. Ansible upgrade to 3.3
3.


Actual results:
2016-10-11 14:23:40,286 p=15651 u=wnhadm |  PLAY [Restart masters] *********************************************************
2016-10-11 14:23:40,296 p=15651 u=wnhadm |  TASK [Restart master system] ***************************************************
2016-10-11 14:23:40,333 p=15651 u=wnhadm |  TASK [Wait for master API to come back online] *********************************
2016-10-11 14:23:40,370 p=15651 u=wnhadm |  TASK [Wait for master to start] ************************************************
2016-10-11 14:23:40,404 p=15651 u=wnhadm |  TASK [Wait for master to become available] *************************************
2016-10-11 14:23:40,438 p=15651 u=wnhadm |  TASK [fail] ********************************************************************
2016-10-11 14:23:40,473 p=15651 u=wnhadm |  TASK [Restart master] **********************************************************
2016-10-11 14:23:40,513 p=15651 u=wnhadm |  TASK [Restart master API] ******************************************************
2016-10-11 14:23:49,069 p=15651 u=wnhadm |  TASK [Wait for master API to come back online] *********************************
task path: /usr/share/ansible/openshift-ansible/playbooks/common/openshift-master/restart_services.yml:11
Using module file /usr/lib/python2.7/site-packages/ansible/modules/core/utilities/logic/wait_for.py
<localhost> ESTABLISH LOCAL CONNECTION FOR USER: wnhadm
<localhost> EXEC /bin/sh -c '/usr/bin/python2 && sleep 0'
fatal: [SIY05E97 -> localhost]: FAILED! => {
    "changed": false,
    "elapsed": 301,
    "failed": true,
    "invocation": {
        "module_args": {
            "connect_timeout": 5,
            "delay": 10,
            "exclude_hosts": null,
            "host": "SIY05E97",
            "path": null,
            "port": 8443,
            "search_regex": null,
            "state": "started",
            "timeout": 300
        },
        "module_name": "wait_for"
    },
    "msg": "Timeout when waiting for SIY05E97:8443"
}

NO MORE HOSTS LEFT *************************************************************
        to retry, use: --limit @/home/wnhadm/.ansible-retry/upgrade.retry

PLAY RECAP *********************************************************************
SIY05E85                   : ok=86   changed=5    unreachable=0    failed=0
SIY05E86                   : ok=86   changed=5    unreachable=0    failed=0
SIY05E87                   : ok=86   changed=5    unreachable=0    failed=0
SIY05E88                   : ok=86   changed=5    unreachable=0    failed=0
SIY05E89                   : ok=86   changed=5    unreachable=0    failed=0
SIY05E90                   : ok=86   changed=5    unreachable=0    failed=0
SIY05E91                   : ok=86   changed=5    unreachable=0    failed=0
SIY05E92                   : ok=86   changed=5    unreachable=0    failed=0
SIY05E93                   : ok=86   changed=5    unreachable=0    failed=0
SIY05E94                   : ok=86   changed=5    unreachable=0    failed=0
SIY05E95                   : ok=86   changed=5    unreachable=0    failed=0
SIY05E96                   : ok=86   changed=5    unreachable=0    failed=0
SIY05E97                   : ok=195  changed=14   unreachable=0    failed=1
SIY05E98                   : ok=189  changed=10   unreachable=0    failed=0
SIY05E99                   : ok=189  changed=10   unreachable=0    failed=0
localhost                  : ok=30   changed=17   unreachable=0    failed=0


Expected results:
Successful upgrade


Additional info:
Issue seems to be in /usr/share/ansible/openshift-ansible/playbooks/common/openshift-master/restart_services.yml

Customer found workaround by commenting out with the following:

- name: Wait for master API to come back online
  become: no
#  local_action:
#    module: wait_for
  wait_for:
    host="{{ inventory_hostname }}"
    state=started
    delay=10
    port="{{ openshift.master.api_port }}"
  when: openshift_master_ha | bool and openshift.master.cluster_method != 'pacemaker'

Doing so, the wait_for module is executed on the remote side.
Same fix can be applied to : 
playbooks/common/openshift-master/restart_hosts.yml
/playbooks/common/openshift-master/restart_hosts_pacemaker.yml
/playbooks/common/openshift-master/restart_services.yml
/playbooks/common/openshift-master/restart_services_pacemaker.yml

Comment 1 Scott Dodson 2016-11-08 19:21:02 UTC

Can you confirm that this is happening only when the ansible host does not have access to the API endpoint? That seems like a really odd configuration, is that expected in this environment?

That said, I agree with the proposed fix. I think the chances of being able to access the API endpoint from the remote host rather than local host is probably higher.

Comment 5 Andrew Butcher 2017-01-05 21:32:40 UTC

Proposed fix: https://github.com/openshift/openshift-ansible/pull/3032

Comment 14 liujia 2017-02-16 11:37:08 UTC

Version:
atomic-openshift-utils-3.3.64-1.git.0.43bfb06.el7.noarch

Step:
1. rpm install ocp3.2
2. upgrade 3.2 to 3.3

Result：
Upgrade successfully.

Comment 15 Anping Li 2017-02-16 12:13:24 UTC

update containerized Env pass too.

Comment 17 errata-xmlrpc 2017-03-06 16:37:11 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2017:0448

Note You need to log in before you can comment on or make changes to this bug.