Bug 1697301 - cluster upgrade fails on timeout after 30 minutes
Summary: cluster upgrade fails on timeout after 30 minutes
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: ovirt-engine
Classification: oVirt
Component: Frontend.Core
Version: 4.3.2.1
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ovirt-4.3.4
: 4.3.4.1
Assignee: Sharon Gratch
QA Contact: Petr Kubica
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-04-08 10:30 UTC by Petr Kubica
Modified: 2019-06-11 06:24 UTC (History)
6 users (show)

Fixed In Version: ovirt-engine-4.3.4.1, ovirt-engine-ui-extensions-1.0.5-1
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-06-11 06:24:11 UTC
oVirt Team: UX
Embargoed:
pm-rhel: ovirt-4.3+
mperina: blocker?
lleistne: testing_ack+


Attachments (Terms of Use)
cluster-upgrade log (2.76 MB, text/plain)
2019-04-08 10:30 UTC, Petr Kubica
no flags Details


Links
System ID Private Priority Status Summary Last Updated
oVirt gerrit 99397 0 master MERGED ansible: Add execution_timeout variable 2020-09-28 21:57:17 UTC
oVirt gerrit 99618 0 ovirt-engine-4.3 MERGED ansible: Add execution_timeout variable 2020-09-28 21:57:23 UTC
oVirt gerrit 99949 0 master MERGED Fixed cluster upgrade execution timeout 2020-09-28 21:57:17 UTC

Description Petr Kubica 2019-04-08 10:30:15 UTC
Created attachment 1553530 [details]
cluster-upgrade log

Description of problem:
Based on behavior it seems that playbook with cluster-upgrade role is killed after 30 minutes (see engine.log and attached log from cluster-upgrade role)
I started cluster upgrade from UI with default value 60 minutes (this variable should be per host)

Role upgraded 3 hosts then the role was killed.

engine.log:
2019-04-08 11:15:35,782+02 INFO  [org.ovirt.engine.core.common.utils.ansible.AnsibleExecutor] (default task-4) [] Executing Ansible command:  /usr/bin/ansible-playbook --ssh-common-args=-F /var/lib/ovirt-engine/.ssh/config -v --private-key=/etc/pki/ovirt-engine/keys/engine_id_rsa --extra-vars=engine_insecure="true" --extra-vars=engine_url="https://brq-setup.rhev.lab.eng.brq.redhat.com:443/ovirt-engine/api" --extra-vars=engine_token="BdZKh1WGnyxLLkFFvY1REim3VjzMhRf4eYZfrbJWNBMrMvZSLMAyZsHX_UD-jF0w2gt7SBVQkfiGMhNI1Ms7ww" --extra-vars=@/tmp/ansible-variables2580215303052505607 /usr/share/ovirt-engine/playbooks/ovirt-cluster-upgrade.yml [Logfile: /var/log/ovirt-engine/ansible/ansible-20190408111535-ovirt-cluster-upgrade_yml.log]
...
2019-04-08 11:45:35,796+02 ERROR [org.ovirt.engine.core.common.utils.ansible.AnsibleExecutor] (default task-4) [] Ansible playbook execution failed: Timeout occurred while executing Ansible playbook.
2019-04-08 11:45:35,797+02 ERROR [org.ovirt.engine.core.services.AnsibleServlet] (default task-4) [] Error while executing ansible-playbook command.

Version-Release number of selected component (if applicable):
ovirt-engine-ui-extensions-1.0.4-1.el7ev.noarch

How reproducible:
always

Steps to Reproduce:
1. run cluster-upgrade from UI with huge environment (multiple physical host, cluster upgrade should take more than 30 minutes)

Actual results:
fails on timeout

Expected results:
role has own timeout, if ansible is running, then the ansible shouldn't be killed

Additional info:

Comment 1 Martin Perina 2019-04-11 12:26:14 UTC
(In reply to Petr Kubica from comment #0)
> Created attachment 1553530 [details]
> cluster-upgrade log
> 
> Description of problem:
> Based on behavior it seems that playbook with cluster-upgrade role is killed
> after 30 minutes (see engine.log and attached log from cluster-upgrade role)
> I started cluster upgrade from UI with default value 60 minutes (this
> variable should be per host)
> 
> Role upgraded 3 hosts then the role was killed.
> 
> engine.log:
> 2019-04-08 11:15:35,782+02 INFO 
> [org.ovirt.engine.core.common.utils.ansible.AnsibleExecutor] (default
> task-4) [] Executing Ansible command:  /usr/bin/ansible-playbook
> --ssh-common-args=-F /var/lib/ovirt-engine/.ssh/config -v
> --private-key=/etc/pki/ovirt-engine/keys/engine_id_rsa
> --extra-vars=engine_insecure="true"
> --extra-vars=engine_url="https://brq-setup.rhev.lab.eng.brq.redhat.com:443/
> ovirt-engine/api"
> --extra-
> vars=engine_token="BdZKh1WGnyxLLkFFvY1REim3VjzMhRf4eYZfrbJWNBMrMvZSLMAyZsHX_U
> D-jF0w2gt7SBVQkfiGMhNI1Ms7ww"
> --extra-vars=@/tmp/ansible-variables2580215303052505607
> /usr/share/ovirt-engine/playbooks/ovirt-cluster-upgrade.yml [Logfile:
> /var/log/ovirt-engine/ansible/ansible-20190408111535-ovirt-cluster-
> upgrade_yml.log]
> ...
> 2019-04-08 11:45:35,796+02 ERROR
> [org.ovirt.engine.core.common.utils.ansible.AnsibleExecutor] (default
> task-4) [] Ansible playbook execution failed: Timeout occurred while
> executing Ansible playbook.
> 2019-04-08 11:45:35,797+02 ERROR
> [org.ovirt.engine.core.services.AnsibleServlet] (default task-4) [] Error
> while executing ansible-playbook command.

Ondro, where do have this timeout? I think we have only 60 minutes timeout to upgrade a host, right? Or is there some other timeout?

Comment 2 Martin Perina 2019-04-11 12:30:43 UTC
(In reply to Martin Perina from comment #1)
> (In reply to Petr Kubica from comment #0)
> > Created attachment 1553530 [details]
> > cluster-upgrade log
> > 
> > Description of problem:
> > Based on behavior it seems that playbook with cluster-upgrade role is killed
> > after 30 minutes (see engine.log and attached log from cluster-upgrade role)
> > I started cluster upgrade from UI with default value 60 minutes (this
> > variable should be per host)
> > 
> > Role upgraded 3 hosts then the role was killed.
> > 
> > engine.log:
> > 2019-04-08 11:15:35,782+02 INFO 
> > [org.ovirt.engine.core.common.utils.ansible.AnsibleExecutor] (default
> > task-4) [] Executing Ansible command:  /usr/bin/ansible-playbook
> > --ssh-common-args=-F /var/lib/ovirt-engine/.ssh/config -v
> > --private-key=/etc/pki/ovirt-engine/keys/engine_id_rsa
> > --extra-vars=engine_insecure="true"
> > --extra-vars=engine_url="https://brq-setup.rhev.lab.eng.brq.redhat.com:443/
> > ovirt-engine/api"
> > --extra-
> > vars=engine_token="BdZKh1WGnyxLLkFFvY1REim3VjzMhRf4eYZfrbJWNBMrMvZSLMAyZsHX_U
> > D-jF0w2gt7SBVQkfiGMhNI1Ms7ww"
> > --extra-vars=@/tmp/ansible-variables2580215303052505607
> > /usr/share/ovirt-engine/playbooks/ovirt-cluster-upgrade.yml [Logfile:
> > /var/log/ovirt-engine/ansible/ansible-20190408111535-ovirt-cluster-
> > upgrade_yml.log]
> > ...
> > 2019-04-08 11:45:35,796+02 ERROR
> > [org.ovirt.engine.core.common.utils.ansible.AnsibleExecutor] (default
> > task-4) [] Ansible playbook execution failed: Timeout occurred while
> > executing Ansible playbook.
> > 2019-04-08 11:45:35,797+02 ERROR
> > [org.ovirt.engine.core.services.AnsibleServlet] (default task-4) [] Error
> > while executing ansible-playbook command.
> 
> Ondro, where do have this timeout? I think we have only 60 minutes timeout
> to upgrade a host, right? Or is there some other timeout?

Ahh, I found it, we have 30 minute default timeout for playbook execution in engine:

https://github.com/oVirt/ovirt-engine/blob/master/packaging/services/ovirt-engine/ovirt-engine.conf.in#L644

This is the option which kills the playbook, right?

Comment 3 Ondra Machacek 2019-04-12 10:23:43 UTC
Correct. We may simply override it. If UI will send any meanigful timeout parameter for the specific playbook, we can pass it to executor and override it. I've sent a patch for it.

Comment 4 Petr Kubica 2019-04-12 11:27:56 UTC
Just a note: new timeout should be based on number of upgraded host multiply timeout per host upgrade (which is already provided from user)

Comment 6 Martin Perina 2019-04-30 09:53:18 UTC
Moving back to POST, we still need to modify UI part

Comment 7 Michal Skrivanek 2019-05-21 13:30:54 UTC
the fix is not in ovirt-engine, so just moving this back

fixed by https://github.com/oVirt/ovirt-engine-ui-extensions/commit/5347430228c898c6683ff3ac83daba08a9b054cb

Comment 8 Petr Kubica 2019-06-03 13:22:22 UTC
Verified in
ovirt-engine-ui-extensions-1.0.5-1.el7ev.noarch
ovirt-engine-4.3.4.2-0.1.el7.noarch

Comment 9 Sandro Bonazzola 2019-06-11 06:24:11 UTC
This bugzilla is included in oVirt 4.3.4 release, published on June 11th 2019.

Since the problem described in this bug report should be
resolved in oVirt 4.3.4 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.