Bug 1455836

Summary:	Upgrades fail due to slow reboots causing timeouts
Product:	OpenShift Container Platform	Reporter:	Marko Myllynen <myllynen>
Component:	Cluster Version Operator	Assignee:	Jan Chaloupka <jchaloup>
Status:	CLOSED ERRATA	QA Contact:	Anping Li <anli>
Severity:	medium	Docs Contact:
Priority:	low
Version:	3.4.1	CC:	anli, aos-bugs, bleanhar, jokerman, jswensso, mmccomas, myllynen
Target Milestone:	---
Target Release:	3.7.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:	Upgrades that made use of system reboots to restart services may have failed when hosts take longer than 5 minutes to restart. The timeout has been increased to 10 minutes. If a host takes longer than 10 minutes it's likely a problem that the admin needs to investigate.	Story Points:	---
Clone Of:		Environment:
Last Closed:	2017-11-28 21:56:17 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Marko Myllynen 2017-05-26 09:20:01 UTC

Description of problem:
Due to known slowness to reboot nodes (see https://bugzilla.redhat.com/show_bug.cgi?id=1441994) the playbook based upgrades (see https://docs.openshift.com/container-platform/latest/install_config/upgrading/automated_upgrades.html#running-the-upgrade-playbook-directly) are occasionally failing as a node reboot may take something like 3-6 minutes but the upgrade playbooks have timeout of 300s (5min) for node reboots.

If bug 1441994 can't be fixed soon then please increase the timeout in playbook to allow more time for slow booting nodes.

Version-Release number of selected component (if applicable):
3.4.1

Comment 1 Scott Dodson 2017-05-26 13:28:47 UTC

Please always include the exact version of the playbooks and the actual error message.

I ask this because '3.4.1' is not an openshift-ansible version but an OCP version and we've shipped 6 different versions of the playbooks for the OCP 3.4 channels.

If you've installed the playbooks via rpm
rpm -q openshift-ansible
or if you're working from a git checkout
git describe

I've not looked at all 3.4 versions but the current version would poll indefinitely every 10 seconds while waiting for a host to come back.

Comment 2 Marko Myllynen 2017-05-26 13:50:05 UTC

This has been seen with several versions coming from OCP 3.4 channels, the one currently in use being:

# rpm -qf /usr/share/ansible/openshift-ansible/playbooks/common/openshift-master/restart_hosts.yml
openshift-ansible-playbooks-3.4.89-1.git.0.ac29ce8.el7.noarch

Which has:

- name: Wait for master to restart
  local_action:
    module: wait_for
      host="{{ inventory_hostname }}"
      state=started
      delay=10
      timeout=300
  become: no

# Now that ssh is back up we can wait for API on the remote system,
# avoiding some potential connection issues from local system:
- name: Wait for master API to come back online
  wait_for:
    host: "{{ openshift.common.hostname }}"
    state: started
    delay: 10
    timeout: 300
    port: "{{ openshift.master.api_port }}"

Thanks.

Comment 3 Scott Dodson 2017-05-26 14:38:25 UTC

Can you please verify your RPMs? I've installed that rpm locally and that code is not present, nor is it present in the github tag.

https://github.com/openshift/openshift-ansible/blob/openshift-ansible-3.4.89-1/playbooks/common/openshift-master/restart_hosts.yml#L10-L16

Comment 4 Marko Myllynen 2017-05-26 17:44:08 UTC

Sorry for being unclear. Here is a verified example:

# rpm -qf /usr/share/ansible/openshift-ansible/playbooks/common/openshift-master/restart_hosts.yml
openshift-ansible-playbooks-3.4.89-1.git.0.ac29ce8.el7.noarch
# rpm -V openshift-ansible-playbooks-3.4.89-1.git.0.ac29ce8.el7.noarch
# cat /usr/share/ansible/openshift-ansible/playbooks/common/openshift-master/restart_hosts.yml
---
- name: Restart master system
  # https://github.com/ansible/ansible/issues/10616
  shell: sleep 2 && shutdown -r now "OpenShift Ansible master rolling restart"
  async: 1
  poll: 0
  ignore_errors: true
  become: yes

- name: Wait for master to restart
  local_action:
    module: wait_for
      host="{{ inventory_hostname }}"
      state=started
      delay=10
  become: no

# Now that ssh is back up we can wait for API on the remote system,
# avoiding some potential connection issues from local system:
- name: Wait for master API to come back online
  wait_for:
    host: "{{ openshift.common.hostname }}"
    state: started
    delay: 10
    port: "{{ openshift.master.api_port }}"
# 

So the "timeout" parameter was extraneous there after some local experiments (setting it to 600 here fixes the issue). But as seen above, the also here the default timeout of 300 [1] is in use.

1) https://docs.ansible.com/ansible/wait_for_module.html

Thanks.

Comment 5 Brenton Leanhardt 2017-06-27 17:47:33 UTC

I'd rather we spend time making the upgrade process pick up where it left off than fighting a battle of tuning timeouts for every possible situation.  If more customers report this I'm not opposed to the "double the timeout" bandaid.

Comment 6 Marko Myllynen 2017-06-28 08:48:27 UTC

(In reply to Brenton Leanhardt from comment #5)
> I'd rather we spend time making the upgrade process pick up where it left
> off than fighting a battle of tuning timeouts for every possible situation.

Note that here we are talking about a known bug which currently prevents upgrades altogether unless the playbooks are manually edited. Fixing the known issue would of course be the best option but there has been no movement on that front. Increasing the timeout would be a trivial bandaid to avoid this and allow upgrading. Improving playbooks to continue where they left off even in case of bugs sounds like a welcome improvement but it sounds like non-trivial effort.

Comment 9 Jan Chaloupka 2017-06-28 13:48:43 UTC

Upstream PR: https://github.com/openshift/openshift-ansible/pull/4624

Comment 10 Scott Dodson 2017-08-24 19:45:35 UTC

Timeout has been increased in master.

Comment 11 Anping Li 2017-08-30 02:37:55 UTC

The upgrade works well with openshift-ansible:master branch.

Comment 14 errata-xmlrpc 2017-11-28 21:56:17 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2017:3188