1455836 – Upgrades fail due to slow reboots causing timeouts

Bug 1455836 - Upgrades fail due to slow reboots causing timeouts

Summary: Upgrades fail due to slow reboots causing timeouts

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cluster Version Operator
Sub Component:
Version:	3.4.1
Hardware:	Unspecified
OS:	Unspecified
Priority:	low
Severity:	medium
Target Milestone:	---
Target Release:	3.7.0
Assignee:	Jan Chaloupka
QA Contact:	Anping Li
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2017-05-26 09:20 UTC by Marko Myllynen
Modified:	2017-11-28 21:56 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Upgrades that made use of system reboots to restart services may have failed when hosts take longer than 5 minutes to restart. The timeout has been increased to 10 minutes. If a host takes longer than 10 minutes it's likely a problem that the admin needs to investigate.
Clone Of:
Environment:
Last Closed:	2017-11-28 21:56:17 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2017:3188	0	normal	SHIPPED_LIVE	Moderate: Red Hat OpenShift Container Platform 3.7 security, bug, and enhancement update	2017-11-29 02:34:54 UTC

Description Marko Myllynen 2017-05-26 09:20:01 UTC

Description of problem:
Due to known slowness to reboot nodes (see https://bugzilla.redhat.com/show_bug.cgi?id=1441994) the playbook based upgrades (see https://docs.openshift.com/container-platform/latest/install_config/upgrading/automated_upgrades.html#running-the-upgrade-playbook-directly) are occasionally failing as a node reboot may take something like 3-6 minutes but the upgrade playbooks have timeout of 300s (5min) for node reboots.

If bug 1441994 can't be fixed soon then please increase the timeout in playbook to allow more time for slow booting nodes.

Version-Release number of selected component (if applicable):
3.4.1

Comment 1 Scott Dodson 2017-05-26 13:28:47 UTC

Please always include the exact version of the playbooks and the actual error message.

I ask this because '3.4.1' is not an openshift-ansible version but an OCP version and we've shipped 6 different versions of the playbooks for the OCP 3.4 channels.

If you've installed the playbooks via rpm
rpm -q openshift-ansible
or if you're working from a git checkout
git describe

I've not looked at all 3.4 versions but the current version would poll indefinitely every 10 seconds while waiting for a host to come back.

Comment 2 Marko Myllynen 2017-05-26 13:50:05 UTC

This has been seen with several versions coming from OCP 3.4 channels, the one currently in use being:

# rpm -qf /usr/share/ansible/openshift-ansible/playbooks/common/openshift-master/restart_hosts.yml
openshift-ansible-playbooks-3.4.89-1.git.0.ac29ce8.el7.noarch

Which has:

- name: Wait for master to restart
  local_action:
    module: wait_for
      host="{{ inventory_hostname }}"
      state=started
      delay=10
      timeout=300
  become: no

# Now that ssh is back up we can wait for API on the remote system,
# avoiding some potential connection issues from local system:
- name: Wait for master API to come back online
  wait_for:
    host: "{{ openshift.common.hostname }}"
    state: started
    delay: 10
    timeout: 300
    port: "{{ openshift.master.api_port }}"

Thanks.

Comment 3 Scott Dodson 2017-05-26 14:38:25 UTC

Can you please verify your RPMs? I've installed that rpm locally and that code is not present, nor is it present in the github tag.

https://github.com/openshift/openshift-ansible/blob/openshift-ansible-3.4.89-1/playbooks/common/openshift-master/restart_hosts.yml#L10-L16

Comment 4 Marko Myllynen 2017-05-26 17:44:08 UTC

Sorry for being unclear. Here is a verified example:

# rpm -qf /usr/share/ansible/openshift-ansible/playbooks/common/openshift-master/restart_hosts.yml
openshift-ansible-playbooks-3.4.89-1.git.0.ac29ce8.el7.noarch
# rpm -V openshift-ansible-playbooks-3.4.89-1.git.0.ac29ce8.el7.noarch
# cat /usr/share/ansible/openshift-ansible/playbooks/common/openshift-master/restart_hosts.yml
---
- name: Restart master system
  # https://github.com/ansible/ansible/issues/10616
  shell: sleep 2 && shutdown -r now "OpenShift Ansible master rolling restart"
  async: 1
  poll: 0
  ignore_errors: true
  become: yes

- name: Wait for master to restart
  local_action:
    module: wait_for
      host="{{ inventory_hostname }}"
      state=started
      delay=10
  become: no

# Now that ssh is back up we can wait for API on the remote system,
# avoiding some potential connection issues from local system:
- name: Wait for master API to come back online
  wait_for:
    host: "{{ openshift.common.hostname }}"
    state: started
    delay: 10
    port: "{{ openshift.master.api_port }}"
# 

So the "timeout" parameter was extraneous there after some local experiments (setting it to 600 here fixes the issue). But as seen above, the also here the default timeout of 300 [1] is in use.

1) https://docs.ansible.com/ansible/wait_for_module.html

Thanks.

Comment 5 Brenton Leanhardt 2017-06-27 17:47:33 UTC

I'd rather we spend time making the upgrade process pick up where it left off than fighting a battle of tuning timeouts for every possible situation.  If more customers report this I'm not opposed to the "double the timeout" bandaid.

Comment 6 Marko Myllynen 2017-06-28 08:48:27 UTC

(In reply to Brenton Leanhardt from comment #5)
> I'd rather we spend time making the upgrade process pick up where it left
> off than fighting a battle of tuning timeouts for every possible situation.

Note that here we are talking about a known bug which currently prevents upgrades altogether unless the playbooks are manually edited. Fixing the known issue would of course be the best option but there has been no movement on that front. Increasing the timeout would be a trivial bandaid to avoid this and allow upgrading. Improving playbooks to continue where they left off even in case of bugs sounds like a welcome improvement but it sounds like non-trivial effort.

Comment 9 Jan Chaloupka 2017-06-28 13:48:43 UTC

Upstream PR: https://github.com/openshift/openshift-ansible/pull/4624

Comment 10 Scott Dodson 2017-08-24 19:45:35 UTC

Timeout has been increased in master.

Comment 11 Anping Li 2017-08-30 02:37:55 UTC

The upgrade works well with openshift-ansible:master branch.

Comment 14 errata-xmlrpc 2017-11-28 21:56:17 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2017:3188

Note You need to log in before you can comment on or make changes to this bug.