Bug 1455836
Summary: | Upgrades fail due to slow reboots causing timeouts | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Marko Myllynen <myllynen> |
Component: | Cluster Version Operator | Assignee: | Jan Chaloupka <jchaloup> |
Status: | CLOSED ERRATA | QA Contact: | Anping Li <anli> |
Severity: | medium | Docs Contact: | |
Priority: | low | ||
Version: | 3.4.1 | CC: | anli, aos-bugs, bleanhar, jokerman, jswensso, mmccomas, myllynen |
Target Milestone: | --- | ||
Target Release: | 3.7.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: |
Upgrades that made use of system reboots to restart services may have failed when hosts take longer than 5 minutes to restart. The timeout has been increased to 10 minutes. If a host takes longer than 10 minutes it's likely a problem that the admin needs to investigate.
|
Story Points: | --- |
Clone Of: | Environment: | ||
Last Closed: | 2017-11-28 21:56:17 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Marko Myllynen
2017-05-26 09:20:01 UTC
Please always include the exact version of the playbooks and the actual error message. I ask this because '3.4.1' is not an openshift-ansible version but an OCP version and we've shipped 6 different versions of the playbooks for the OCP 3.4 channels. If you've installed the playbooks via rpm rpm -q openshift-ansible or if you're working from a git checkout git describe I've not looked at all 3.4 versions but the current version would poll indefinitely every 10 seconds while waiting for a host to come back. This has been seen with several versions coming from OCP 3.4 channels, the one currently in use being: # rpm -qf /usr/share/ansible/openshift-ansible/playbooks/common/openshift-master/restart_hosts.yml openshift-ansible-playbooks-3.4.89-1.git.0.ac29ce8.el7.noarch Which has: - name: Wait for master to restart local_action: module: wait_for host="{{ inventory_hostname }}" state=started delay=10 timeout=300 become: no # Now that ssh is back up we can wait for API on the remote system, # avoiding some potential connection issues from local system: - name: Wait for master API to come back online wait_for: host: "{{ openshift.common.hostname }}" state: started delay: 10 timeout: 300 port: "{{ openshift.master.api_port }}" Thanks. Can you please verify your RPMs? I've installed that rpm locally and that code is not present, nor is it present in the github tag. https://github.com/openshift/openshift-ansible/blob/openshift-ansible-3.4.89-1/playbooks/common/openshift-master/restart_hosts.yml#L10-L16 Sorry for being unclear. Here is a verified example: # rpm -qf /usr/share/ansible/openshift-ansible/playbooks/common/openshift-master/restart_hosts.yml openshift-ansible-playbooks-3.4.89-1.git.0.ac29ce8.el7.noarch # rpm -V openshift-ansible-playbooks-3.4.89-1.git.0.ac29ce8.el7.noarch # cat /usr/share/ansible/openshift-ansible/playbooks/common/openshift-master/restart_hosts.yml --- - name: Restart master system # https://github.com/ansible/ansible/issues/10616 shell: sleep 2 && shutdown -r now "OpenShift Ansible master rolling restart" async: 1 poll: 0 ignore_errors: true become: yes - name: Wait for master to restart local_action: module: wait_for host="{{ inventory_hostname }}" state=started delay=10 become: no # Now that ssh is back up we can wait for API on the remote system, # avoiding some potential connection issues from local system: - name: Wait for master API to come back online wait_for: host: "{{ openshift.common.hostname }}" state: started delay: 10 port: "{{ openshift.master.api_port }}" # So the "timeout" parameter was extraneous there after some local experiments (setting it to 600 here fixes the issue). But as seen above, the also here the default timeout of 300 [1] is in use. 1) https://docs.ansible.com/ansible/wait_for_module.html Thanks. I'd rather we spend time making the upgrade process pick up where it left off than fighting a battle of tuning timeouts for every possible situation. If more customers report this I'm not opposed to the "double the timeout" bandaid. (In reply to Brenton Leanhardt from comment #5) > I'd rather we spend time making the upgrade process pick up where it left > off than fighting a battle of tuning timeouts for every possible situation. Note that here we are talking about a known bug which currently prevents upgrades altogether unless the playbooks are manually edited. Fixing the known issue would of course be the best option but there has been no movement on that front. Increasing the timeout would be a trivial bandaid to avoid this and allow upgrading. Improving playbooks to continue where they left off even in case of bugs sounds like a welcome improvement but it sounds like non-trivial effort. Timeout has been increased in master. The upgrade works well with openshift-ansible:master branch. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2017:3188 |