Description of problem: Due to known slowness to reboot nodes (see https://bugzilla.redhat.com/show_bug.cgi?id=1441994) the playbook based upgrades (see https://docs.openshift.com/container-platform/latest/install_config/upgrading/automated_upgrades.html#running-the-upgrade-playbook-directly) are occasionally failing as a node reboot may take something like 3-6 minutes but the upgrade playbooks have timeout of 300s (5min) for node reboots. If bug 1441994 can't be fixed soon then please increase the timeout in playbook to allow more time for slow booting nodes. Version-Release number of selected component (if applicable): 3.4.1
Please always include the exact version of the playbooks and the actual error message. I ask this because '3.4.1' is not an openshift-ansible version but an OCP version and we've shipped 6 different versions of the playbooks for the OCP 3.4 channels. If you've installed the playbooks via rpm rpm -q openshift-ansible or if you're working from a git checkout git describe I've not looked at all 3.4 versions but the current version would poll indefinitely every 10 seconds while waiting for a host to come back.
This has been seen with several versions coming from OCP 3.4 channels, the one currently in use being: # rpm -qf /usr/share/ansible/openshift-ansible/playbooks/common/openshift-master/restart_hosts.yml openshift-ansible-playbooks-3.4.89-1.git.0.ac29ce8.el7.noarch Which has: - name: Wait for master to restart local_action: module: wait_for host="{{ inventory_hostname }}" state=started delay=10 timeout=300 become: no # Now that ssh is back up we can wait for API on the remote system, # avoiding some potential connection issues from local system: - name: Wait for master API to come back online wait_for: host: "{{ openshift.common.hostname }}" state: started delay: 10 timeout: 300 port: "{{ openshift.master.api_port }}" Thanks.
Can you please verify your RPMs? I've installed that rpm locally and that code is not present, nor is it present in the github tag. https://github.com/openshift/openshift-ansible/blob/openshift-ansible-3.4.89-1/playbooks/common/openshift-master/restart_hosts.yml#L10-L16
Sorry for being unclear. Here is a verified example: # rpm -qf /usr/share/ansible/openshift-ansible/playbooks/common/openshift-master/restart_hosts.yml openshift-ansible-playbooks-3.4.89-1.git.0.ac29ce8.el7.noarch # rpm -V openshift-ansible-playbooks-3.4.89-1.git.0.ac29ce8.el7.noarch # cat /usr/share/ansible/openshift-ansible/playbooks/common/openshift-master/restart_hosts.yml --- - name: Restart master system # https://github.com/ansible/ansible/issues/10616 shell: sleep 2 && shutdown -r now "OpenShift Ansible master rolling restart" async: 1 poll: 0 ignore_errors: true become: yes - name: Wait for master to restart local_action: module: wait_for host="{{ inventory_hostname }}" state=started delay=10 become: no # Now that ssh is back up we can wait for API on the remote system, # avoiding some potential connection issues from local system: - name: Wait for master API to come back online wait_for: host: "{{ openshift.common.hostname }}" state: started delay: 10 port: "{{ openshift.master.api_port }}" # So the "timeout" parameter was extraneous there after some local experiments (setting it to 600 here fixes the issue). But as seen above, the also here the default timeout of 300 [1] is in use. 1) https://docs.ansible.com/ansible/wait_for_module.html Thanks.
I'd rather we spend time making the upgrade process pick up where it left off than fighting a battle of tuning timeouts for every possible situation. If more customers report this I'm not opposed to the "double the timeout" bandaid.
(In reply to Brenton Leanhardt from comment #5) > I'd rather we spend time making the upgrade process pick up where it left > off than fighting a battle of tuning timeouts for every possible situation. Note that here we are talking about a known bug which currently prevents upgrades altogether unless the playbooks are manually edited. Fixing the known issue would of course be the best option but there has been no movement on that front. Increasing the timeout would be a trivial bandaid to avoid this and allow upgrading. Improving playbooks to continue where they left off even in case of bugs sounds like a welcome improvement but it sounds like non-trivial effort.
Upstream PR: https://github.com/openshift/openshift-ansible/pull/4624
Timeout has been increased in master.
The upgrade works well with openshift-ansible:master branch.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2017:3188