Bug 1393000 - [3.3] Ansible upgrade from 3.2 to 3.3 fails
Summary: [3.3] Ansible upgrade from 3.2 to 3.3 fails
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cluster Version Operator
Version: 3.3.1
Hardware: Unspecified
OS: Unspecified
medium
high
Target Milestone: ---
: 3.3.1
Assignee: Andrew Butcher
QA Contact: liujia
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-11-08 16:34 UTC by Brendan Mchugh
Modified: 2020-04-15 14:49 UTC (History)
11 users (show)

Fixed In Version: openshift-ansible-3.3.62-1.git.0.b7473e7.el7
Doc Type: Bug Fix
Doc Text:
Previously, API verification during upgrades was performed from the ansible control host which may not have network access to each API server in some network topologies. Now API server verification happens from the master hosts avoiding problems with network access.
Clone Of:
Environment:
Last Closed: 2017-03-06 16:37:11 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2017:0448 0 normal SHIPPED_LIVE Important: ansible and openshift-ansible security and bug fix update 2017-03-06 21:36:25 UTC

Description Brendan Mchugh 2016-11-08 16:34:21 UTC
Description of problem:
Ansible upgrade from 3.2 to 3.3 fails with "Timeout when waiting for NODE"


Version-Release number of selected component (if applicable):
openshift-ansible-playbooks-3.3.22-1.git.0.6c888c2.el7.noarch


How reproducible:
Always but different nodes may fail


Steps to Reproduce:
1. Install 3.2 
2. Ansible upgrade to 3.3
3.


Actual results:
2016-10-11 14:23:40,286 p=15651 u=wnhadm |  PLAY [Restart masters] *********************************************************
2016-10-11 14:23:40,296 p=15651 u=wnhadm |  TASK [Restart master system] ***************************************************
2016-10-11 14:23:40,333 p=15651 u=wnhadm |  TASK [Wait for master API to come back online] *********************************
2016-10-11 14:23:40,370 p=15651 u=wnhadm |  TASK [Wait for master to start] ************************************************
2016-10-11 14:23:40,404 p=15651 u=wnhadm |  TASK [Wait for master to become available] *************************************
2016-10-11 14:23:40,438 p=15651 u=wnhadm |  TASK [fail] ********************************************************************
2016-10-11 14:23:40,473 p=15651 u=wnhadm |  TASK [Restart master] **********************************************************
2016-10-11 14:23:40,513 p=15651 u=wnhadm |  TASK [Restart master API] ******************************************************
2016-10-11 14:23:49,069 p=15651 u=wnhadm |  TASK [Wait for master API to come back online] *********************************
task path: /usr/share/ansible/openshift-ansible/playbooks/common/openshift-master/restart_services.yml:11
Using module file /usr/lib/python2.7/site-packages/ansible/modules/core/utilities/logic/wait_for.py
<localhost> ESTABLISH LOCAL CONNECTION FOR USER: wnhadm
<localhost> EXEC /bin/sh -c '/usr/bin/python2 && sleep 0'
fatal: [SIY05E97 -> localhost]: FAILED! => {
    "changed": false,
    "elapsed": 301,
    "failed": true,
    "invocation": {
        "module_args": {
            "connect_timeout": 5,
            "delay": 10,
            "exclude_hosts": null,
            "host": "SIY05E97",
            "path": null,
            "port": 8443,
            "search_regex": null,
            "state": "started",
            "timeout": 300
        },
        "module_name": "wait_for"
    },
    "msg": "Timeout when waiting for SIY05E97:8443"
}

NO MORE HOSTS LEFT *************************************************************
        to retry, use: --limit @/home/wnhadm/.ansible-retry/upgrade.retry

PLAY RECAP *********************************************************************
SIY05E85                   : ok=86   changed=5    unreachable=0    failed=0
SIY05E86                   : ok=86   changed=5    unreachable=0    failed=0
SIY05E87                   : ok=86   changed=5    unreachable=0    failed=0
SIY05E88                   : ok=86   changed=5    unreachable=0    failed=0
SIY05E89                   : ok=86   changed=5    unreachable=0    failed=0
SIY05E90                   : ok=86   changed=5    unreachable=0    failed=0
SIY05E91                   : ok=86   changed=5    unreachable=0    failed=0
SIY05E92                   : ok=86   changed=5    unreachable=0    failed=0
SIY05E93                   : ok=86   changed=5    unreachable=0    failed=0
SIY05E94                   : ok=86   changed=5    unreachable=0    failed=0
SIY05E95                   : ok=86   changed=5    unreachable=0    failed=0
SIY05E96                   : ok=86   changed=5    unreachable=0    failed=0
SIY05E97                   : ok=195  changed=14   unreachable=0    failed=1
SIY05E98                   : ok=189  changed=10   unreachable=0    failed=0
SIY05E99                   : ok=189  changed=10   unreachable=0    failed=0
localhost                  : ok=30   changed=17   unreachable=0    failed=0


Expected results:
Successful upgrade


Additional info:
Issue seems to be in /usr/share/ansible/openshift-ansible/playbooks/common/openshift-master/restart_services.yml

Customer found workaround by commenting out with the following:

- name: Wait for master API to come back online
  become: no
#  local_action:
#    module: wait_for
  wait_for:
    host="{{ inventory_hostname }}"
    state=started
    delay=10
    port="{{ openshift.master.api_port }}"
  when: openshift_master_ha | bool and openshift.master.cluster_method != 'pacemaker'

Doing so, the wait_for module is executed on the remote side.
Same fix can be applied to : 
playbooks/common/openshift-master/restart_hosts.yml
/playbooks/common/openshift-master/restart_hosts_pacemaker.yml
/playbooks/common/openshift-master/restart_services.yml
/playbooks/common/openshift-master/restart_services_pacemaker.yml

Comment 1 Scott Dodson 2016-11-08 19:21:02 UTC
Can you confirm that this is happening only when the ansible host does not have access to the API endpoint? That seems like a really odd configuration, is that expected in this environment?

That said, I agree with the proposed fix. I think the chances of being able to access the API endpoint from the remote host rather than local host is probably higher.

Comment 5 Andrew Butcher 2017-01-05 21:32:40 UTC
Proposed fix: https://github.com/openshift/openshift-ansible/pull/3032

Comment 14 liujia 2017-02-16 11:37:08 UTC
Version:
atomic-openshift-utils-3.3.64-1.git.0.43bfb06.el7.noarch

Step:
1. rpm install ocp3.2
2. upgrade 3.2 to 3.3

Result:
Upgrade successfully.

Comment 15 Anping Li 2017-02-16 12:13:24 UTC
update containerized Env pass too.

Comment 17 errata-xmlrpc 2017-03-06 16:37:11 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2017:0448


Note You need to log in before you can comment on or make changes to this bug.