Bug 1455836 - Upgrades fail due to slow reboots causing timeouts
Summary: Upgrades fail due to slow reboots causing timeouts
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cluster Version Operator
Version: 3.4.1
Hardware: Unspecified
OS: Unspecified
low
medium
Target Milestone: ---
: 3.7.0
Assignee: Jan Chaloupka
QA Contact: Anping Li
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-05-26 09:20 UTC by Marko Myllynen
Modified: 2017-11-28 21:56 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Upgrades that made use of system reboots to restart services may have failed when hosts take longer than 5 minutes to restart. The timeout has been increased to 10 minutes. If a host takes longer than 10 minutes it's likely a problem that the admin needs to investigate.
Clone Of:
Environment:
Last Closed: 2017-11-28 21:56:17 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2017:3188 normal SHIPPED_LIVE Moderate: Red Hat OpenShift Container Platform 3.7 security, bug, and enhancement update 2017-11-29 02:34:54 UTC

Description Marko Myllynen 2017-05-26 09:20:01 UTC
Description of problem:
Due to known slowness to reboot nodes (see https://bugzilla.redhat.com/show_bug.cgi?id=1441994) the playbook based upgrades (see https://docs.openshift.com/container-platform/latest/install_config/upgrading/automated_upgrades.html#running-the-upgrade-playbook-directly) are occasionally failing as a node reboot may take something like 3-6 minutes but the upgrade playbooks have timeout of 300s (5min) for node reboots.

If bug 1441994 can't be fixed soon then please increase the timeout in playbook to allow more time for slow booting nodes.

Version-Release number of selected component (if applicable):
3.4.1

Comment 1 Scott Dodson 2017-05-26 13:28:47 UTC
Please always include the exact version of the playbooks and the actual error message.

I ask this because '3.4.1' is not an openshift-ansible version but an OCP version and we've shipped 6 different versions of the playbooks for the OCP 3.4 channels.

If you've installed the playbooks via rpm
rpm -q openshift-ansible
or if you're working from a git checkout
git describe

I've not looked at all 3.4 versions but the current version would poll indefinitely every 10 seconds while waiting for a host to come back.

Comment 2 Marko Myllynen 2017-05-26 13:50:05 UTC
This has been seen with several versions coming from OCP 3.4 channels, the one currently in use being:

# rpm -qf /usr/share/ansible/openshift-ansible/playbooks/common/openshift-master/restart_hosts.yml
openshift-ansible-playbooks-3.4.89-1.git.0.ac29ce8.el7.noarch

Which has:

- name: Wait for master to restart
  local_action:
    module: wait_for
      host="{{ inventory_hostname }}"
      state=started
      delay=10
      timeout=300
  become: no

# Now that ssh is back up we can wait for API on the remote system,
# avoiding some potential connection issues from local system:
- name: Wait for master API to come back online
  wait_for:
    host: "{{ openshift.common.hostname }}"
    state: started
    delay: 10
    timeout: 300
    port: "{{ openshift.master.api_port }}"

Thanks.

Comment 3 Scott Dodson 2017-05-26 14:38:25 UTC
Can you please verify your RPMs? I've installed that rpm locally and that code is not present, nor is it present in the github tag.

https://github.com/openshift/openshift-ansible/blob/openshift-ansible-3.4.89-1/playbooks/common/openshift-master/restart_hosts.yml#L10-L16

Comment 4 Marko Myllynen 2017-05-26 17:44:08 UTC
Sorry for being unclear. Here is a verified example:

# rpm -qf /usr/share/ansible/openshift-ansible/playbooks/common/openshift-master/restart_hosts.yml
openshift-ansible-playbooks-3.4.89-1.git.0.ac29ce8.el7.noarch
# rpm -V openshift-ansible-playbooks-3.4.89-1.git.0.ac29ce8.el7.noarch
# cat /usr/share/ansible/openshift-ansible/playbooks/common/openshift-master/restart_hosts.yml
---
- name: Restart master system
  # https://github.com/ansible/ansible/issues/10616
  shell: sleep 2 && shutdown -r now "OpenShift Ansible master rolling restart"
  async: 1
  poll: 0
  ignore_errors: true
  become: yes

- name: Wait for master to restart
  local_action:
    module: wait_for
      host="{{ inventory_hostname }}"
      state=started
      delay=10
  become: no

# Now that ssh is back up we can wait for API on the remote system,
# avoiding some potential connection issues from local system:
- name: Wait for master API to come back online
  wait_for:
    host: "{{ openshift.common.hostname }}"
    state: started
    delay: 10
    port: "{{ openshift.master.api_port }}"
# 

So the "timeout" parameter was extraneous there after some local experiments (setting it to 600 here fixes the issue). But as seen above, the also here the default timeout of 300 [1] is in use.

1) https://docs.ansible.com/ansible/wait_for_module.html

Thanks.

Comment 5 Brenton Leanhardt 2017-06-27 17:47:33 UTC
I'd rather we spend time making the upgrade process pick up where it left off than fighting a battle of tuning timeouts for every possible situation.  If more customers report this I'm not opposed to the "double the timeout" bandaid.

Comment 6 Marko Myllynen 2017-06-28 08:48:27 UTC
(In reply to Brenton Leanhardt from comment #5)
> I'd rather we spend time making the upgrade process pick up where it left
> off than fighting a battle of tuning timeouts for every possible situation.

Note that here we are talking about a known bug which currently prevents upgrades altogether unless the playbooks are manually edited. Fixing the known issue would of course be the best option but there has been no movement on that front. Increasing the timeout would be a trivial bandaid to avoid this and allow upgrading. Improving playbooks to continue where they left off even in case of bugs sounds like a welcome improvement but it sounds like non-trivial effort.

Comment 9 Jan Chaloupka 2017-06-28 13:48:43 UTC
Upstream PR: https://github.com/openshift/openshift-ansible/pull/4624

Comment 10 Scott Dodson 2017-08-24 19:45:35 UTC
Timeout has been increased in master.

Comment 11 Anping Li 2017-08-30 02:37:55 UTC
The upgrade works well with openshift-ansible:master branch.

Comment 14 errata-xmlrpc 2017-11-28 21:56:17 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2017:3188


Note You need to log in before you can comment on or make changes to this bug.