Bug 1697301

Summary: cluster upgrade fails on timeout after 30 minutes
Product: [oVirt] ovirt-engine Reporter: Petr Kubica <pkubica>
Component: Frontend.CoreAssignee: Sharon Gratch <sgratch>
Status: CLOSED CURRENTRELEASE QA Contact: Petr Kubica <pkubica>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.3.2.1CC: bugs, lleistne, michal.skrivanek, mperina, omachace, sgratch
Target Milestone: ovirt-4.3.4Flags: pm-rhel: ovirt-4.3+
mperina: blocker?
lleistne: testing_ack+
Target Release: 4.3.4.1   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: ovirt-engine-4.3.4.1, ovirt-engine-ui-extensions-1.0.5-1 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-06-11 06:24:11 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: UX RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
cluster-upgrade log none

Description Petr Kubica 2019-04-08 10:30:15 UTC
Created attachment 1553530 [details]
cluster-upgrade log

Description of problem:
Based on behavior it seems that playbook with cluster-upgrade role is killed after 30 minutes (see engine.log and attached log from cluster-upgrade role)
I started cluster upgrade from UI with default value 60 minutes (this variable should be per host)

Role upgraded 3 hosts then the role was killed.

engine.log:
2019-04-08 11:15:35,782+02 INFO  [org.ovirt.engine.core.common.utils.ansible.AnsibleExecutor] (default task-4) [] Executing Ansible command:  /usr/bin/ansible-playbook --ssh-common-args=-F /var/lib/ovirt-engine/.ssh/config -v --private-key=/etc/pki/ovirt-engine/keys/engine_id_rsa --extra-vars=engine_insecure="true" --extra-vars=engine_url="https://brq-setup.rhev.lab.eng.brq.redhat.com:443/ovirt-engine/api" --extra-vars=engine_token="BdZKh1WGnyxLLkFFvY1REim3VjzMhRf4eYZfrbJWNBMrMvZSLMAyZsHX_UD-jF0w2gt7SBVQkfiGMhNI1Ms7ww" --extra-vars=@/tmp/ansible-variables2580215303052505607 /usr/share/ovirt-engine/playbooks/ovirt-cluster-upgrade.yml [Logfile: /var/log/ovirt-engine/ansible/ansible-20190408111535-ovirt-cluster-upgrade_yml.log]
...
2019-04-08 11:45:35,796+02 ERROR [org.ovirt.engine.core.common.utils.ansible.AnsibleExecutor] (default task-4) [] Ansible playbook execution failed: Timeout occurred while executing Ansible playbook.
2019-04-08 11:45:35,797+02 ERROR [org.ovirt.engine.core.services.AnsibleServlet] (default task-4) [] Error while executing ansible-playbook command.

Version-Release number of selected component (if applicable):
ovirt-engine-ui-extensions-1.0.4-1.el7ev.noarch

How reproducible:
always

Steps to Reproduce:
1. run cluster-upgrade from UI with huge environment (multiple physical host, cluster upgrade should take more than 30 minutes)

Actual results:
fails on timeout

Expected results:
role has own timeout, if ansible is running, then the ansible shouldn't be killed

Additional info:

Comment 1 Martin Perina 2019-04-11 12:26:14 UTC
(In reply to Petr Kubica from comment #0)
> Created attachment 1553530 [details]
> cluster-upgrade log
> 
> Description of problem:
> Based on behavior it seems that playbook with cluster-upgrade role is killed
> after 30 minutes (see engine.log and attached log from cluster-upgrade role)
> I started cluster upgrade from UI with default value 60 minutes (this
> variable should be per host)
> 
> Role upgraded 3 hosts then the role was killed.
> 
> engine.log:
> 2019-04-08 11:15:35,782+02 INFO 
> [org.ovirt.engine.core.common.utils.ansible.AnsibleExecutor] (default
> task-4) [] Executing Ansible command:  /usr/bin/ansible-playbook
> --ssh-common-args=-F /var/lib/ovirt-engine/.ssh/config -v
> --private-key=/etc/pki/ovirt-engine/keys/engine_id_rsa
> --extra-vars=engine_insecure="true"
> --extra-vars=engine_url="https://brq-setup.rhev.lab.eng.brq.redhat.com:443/
> ovirt-engine/api"
> --extra-
> vars=engine_token="BdZKh1WGnyxLLkFFvY1REim3VjzMhRf4eYZfrbJWNBMrMvZSLMAyZsHX_U
> D-jF0w2gt7SBVQkfiGMhNI1Ms7ww"
> --extra-vars=@/tmp/ansible-variables2580215303052505607
> /usr/share/ovirt-engine/playbooks/ovirt-cluster-upgrade.yml [Logfile:
> /var/log/ovirt-engine/ansible/ansible-20190408111535-ovirt-cluster-
> upgrade_yml.log]
> ...
> 2019-04-08 11:45:35,796+02 ERROR
> [org.ovirt.engine.core.common.utils.ansible.AnsibleExecutor] (default
> task-4) [] Ansible playbook execution failed: Timeout occurred while
> executing Ansible playbook.
> 2019-04-08 11:45:35,797+02 ERROR
> [org.ovirt.engine.core.services.AnsibleServlet] (default task-4) [] Error
> while executing ansible-playbook command.

Ondro, where do have this timeout? I think we have only 60 minutes timeout to upgrade a host, right? Or is there some other timeout?

Comment 2 Martin Perina 2019-04-11 12:30:43 UTC
(In reply to Martin Perina from comment #1)
> (In reply to Petr Kubica from comment #0)
> > Created attachment 1553530 [details]
> > cluster-upgrade log
> > 
> > Description of problem:
> > Based on behavior it seems that playbook with cluster-upgrade role is killed
> > after 30 minutes (see engine.log and attached log from cluster-upgrade role)
> > I started cluster upgrade from UI with default value 60 minutes (this
> > variable should be per host)
> > 
> > Role upgraded 3 hosts then the role was killed.
> > 
> > engine.log:
> > 2019-04-08 11:15:35,782+02 INFO 
> > [org.ovirt.engine.core.common.utils.ansible.AnsibleExecutor] (default
> > task-4) [] Executing Ansible command:  /usr/bin/ansible-playbook
> > --ssh-common-args=-F /var/lib/ovirt-engine/.ssh/config -v
> > --private-key=/etc/pki/ovirt-engine/keys/engine_id_rsa
> > --extra-vars=engine_insecure="true"
> > --extra-vars=engine_url="https://brq-setup.rhev.lab.eng.brq.redhat.com:443/
> > ovirt-engine/api"
> > --extra-
> > vars=engine_token="BdZKh1WGnyxLLkFFvY1REim3VjzMhRf4eYZfrbJWNBMrMvZSLMAyZsHX_U
> > D-jF0w2gt7SBVQkfiGMhNI1Ms7ww"
> > --extra-vars=@/tmp/ansible-variables2580215303052505607
> > /usr/share/ovirt-engine/playbooks/ovirt-cluster-upgrade.yml [Logfile:
> > /var/log/ovirt-engine/ansible/ansible-20190408111535-ovirt-cluster-
> > upgrade_yml.log]
> > ...
> > 2019-04-08 11:45:35,796+02 ERROR
> > [org.ovirt.engine.core.common.utils.ansible.AnsibleExecutor] (default
> > task-4) [] Ansible playbook execution failed: Timeout occurred while
> > executing Ansible playbook.
> > 2019-04-08 11:45:35,797+02 ERROR
> > [org.ovirt.engine.core.services.AnsibleServlet] (default task-4) [] Error
> > while executing ansible-playbook command.
> 
> Ondro, where do have this timeout? I think we have only 60 minutes timeout
> to upgrade a host, right? Or is there some other timeout?

Ahh, I found it, we have 30 minute default timeout for playbook execution in engine:

https://github.com/oVirt/ovirt-engine/blob/master/packaging/services/ovirt-engine/ovirt-engine.conf.in#L644

This is the option which kills the playbook, right?

Comment 3 Ondra Machacek 2019-04-12 10:23:43 UTC
Correct. We may simply override it. If UI will send any meanigful timeout parameter for the specific playbook, we can pass it to executor and override it. I've sent a patch for it.

Comment 4 Petr Kubica 2019-04-12 11:27:56 UTC
Just a note: new timeout should be based on number of upgraded host multiply timeout per host upgrade (which is already provided from user)

Comment 6 Martin Perina 2019-04-30 09:53:18 UTC
Moving back to POST, we still need to modify UI part

Comment 7 Michal Skrivanek 2019-05-21 13:30:54 UTC
the fix is not in ovirt-engine, so just moving this back

fixed by https://github.com/oVirt/ovirt-engine-ui-extensions/commit/5347430228c898c6683ff3ac83daba08a9b054cb

Comment 8 Petr Kubica 2019-06-03 13:22:22 UTC
Verified in
ovirt-engine-ui-extensions-1.0.5-1.el7ev.noarch
ovirt-engine-4.3.4.2-0.1.el7.noarch

Comment 9 Sandro Bonazzola 2019-06-11 06:24:11 UTC
This bugzilla is included in oVirt 4.3.4 release, published on June 11th 2019.

Since the problem described in this bug report should be
resolved in oVirt 4.3.4 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.