Bug 2030293 - VM in locked state forever if manager is rebooted while exporting VM as OVA
Summary: VM in locked state forever if manager is rebooted while exporting VM as OVA
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-engine
Version: 4.4.8
Hardware: All
OS: Linux
unspecified
high
Target Milestone: ovirt-4.5.1
: ---
Assignee: Liran Rotenberg
QA Contact: Nisim Simsolo
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-12-08 11:50 UTC by nijin ashok
Modified: 2022-07-14 12:55 UTC (History)
3 users (show)

Fixed In Version: ovirt-engine-4.5.1.2
Doc Type: Bug Fix
Doc Text:
A VM no longer remains in a permanent locked state if the Manager is rebooted while exporting the VM as OVA.
Clone Of:
Environment:
Last Closed: 2022-07-14 12:54:31 UTC
oVirt Team: Virt
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker RHV-44166 0 None None None 2021-12-08 11:51:22 UTC
Red Hat Product Errata RHSA-2022:5555 0 None None None 2022-07-14 12:55:15 UTC

Description nijin ashok 2021-12-08 11:50:01 UTC
Description of problem:

The manager is rebooted when ansible is running role ovirt-ova-pack.

~~~
2021-12-08 16:53:20 IST - TASK [ovirt-ova-pack : Run packing script] *************************************
~~~

When the engine is back, it tried to check the status of the export.

~~~
2021-12-08 16:56:19,067+05 INFO  [org.ovirt.engine.core.uutils.config.ShellLikeConfd] (ServerService Thread Pool -- 61) [] Loaded file '/usr/share/ovirt-engine/services/ovirt-engine/ovirt-engine.conf'. >>> engine start

2021-12-08 16:56:42,509+05 INFO  [org.ovirt.engine.core.bll.SerialChildCommandsExecutionCallback] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-57) [a9f35684-8b11-4dc9-804c-36db831c19a9] Command 'ExportVmToOva' (id: '9be2b484-1130-4c88-87d4-13453d8a6041') waiting on child command id: '3539c2f4-7f2b-4bf9-b3e3-18df1f0e4fa7' type:'CreateOva' to complete
2021-12-08 16:56:42,558+05 INFO  [org.ovirt.engine.core.bll.SerialChildCommandsExecutionCallback] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-57) [a9f35684-8b11-4dc9-804c-36db831c19a9] Command 'CreateOva' (id: '3539c2f4-7f2b-4bf9-b3e3-18df1f0e4fa7') waiting on child command id: '8c7ae6d7-edb6-47f7-baee-27218e8b8881' type:'AnsiblePackOva' to complete
~~~

It's not in Ansible runner cache and the status request failed with the error "found an incomplete artifacts directory...Possible ansible_runner  error". It's checking the "status" file under artifacts.

~~~
2021-12-08 17:02:03,912 - runner_service.controllers.playbooks - INFO - 127.0.0.1 - GET /api/v1/playbooks/363026fa-5819-11ec-be69-002b6a01557c
2021-12-08 17:02:03,913 - runner_service.services.playbook - DEBUG - runner_cache 'miss' for run 363026fa-5819-11ec-be69-002b6a01557c
2021-12-08 17:02:03,913 - runner_service.services.playbook - WARNING - Status Request for Play uuid '363026fa-5819-11ec-be69-002b6a01557c', found an incomplete artifacts directory...Possible ansible_runner  error?
~~~

The ansible-runner status file will be only created if the task fails/succeed. So we don't have the status file yet.

~~~
# ls -lt /usr/share/ovirt-engine/ansible-runner-service-project/artifacts/363026fa-5819-11ec-be69-002b6a01557c
total 12
drwx------. 2 ovirt ovirt 4096 Dec  8 16:53 job_events
-rw-------. 1 ovirt ovirt 1323 Dec  8 16:53 stdout
drwxr-xr-x. 2 ovirt ovirt   47 Dec  8 16:53 fact_cache
-rw-------. 1 ovirt ovirt 1368 Dec  8 16:53 command
drwx------. 2 ovirt ovirt   54 Dec  8 16:53 env


# tail /usr/share/ovirt-engine/ansible-runner-service-project/artifacts/363026fa-5819-11ec-be69-002b6a01557c/stdout 
TASK [ovirt-ova-export-pre-pack : Removing the temporary file] *****************
changed: [dell-r530-3.gsslab.pnq.redhat.com]

TASK [ovirt-ova-export-pre-pack : Allocating the temporary path for the OVA file] ***
changed: [dell-r530-3.gsslab.pnq.redhat.com]

TASK [ovirt-ova-export-pre-pack : Retrieving the temporary path for the OVA file] ***
changed: [dell-r530-3.gsslab.pnq.redhat.com]

TASK [ovirt-ova-pack : Run packing script] *************************************
~~~

Although the packing script finishes in the host, the engine lost the status of the ansible task and waits indefinitely keeping the VMs locked.

We were only able to unlock the VM after manually creating a status file with content "failed".


Version-Release number of selected component (if applicable):

rhvm-4.4.8.6-0.1.el8ev.noarch

How reproducible:

100%

Steps to Reproduce:

1. Export a VM as OVA into the host filesystem. Get a VM with a 10+ GB disk size so that it takes time for running the "packing script".

2. Watch for the logs under "/var/log/ovirt-engine/ova/" and reboot the manager when it runs role ovirt-ova-pack.

3. After the manager reboot, the VM stays in locked status forever even when the pack_ova.py completes in the host (watch the ps output in the host).

Actual results:

VM in locked state forever if manager is rebooted while exporting VM as OVA

Expected results:

VM should be unlocked and mark the operation as failed.

Additional info:

Comment 2 Liran Rotenberg 2021-12-14 16:18:19 UTC
The problem is regarding the ansible-runner-service missing the artifact:
2021-12-08 16:53:20,021 - runner_service.controllers.playbooks - INFO - 127.0.0.1 - GET /api/v1/playbooks/363026fa-5819-11ec-be69-002b6a01557c
2021-12-08 16:53:20,021 - runner_service.services.playbook - DEBUG - runner_cache 'hit' for playbook status request
2021-12-08 16:53:20,023 - runner_service.controllers.jobs - DEBUG - Request received, content-type :None
...
2021-12-08 16:53:20,064 - runner_service.controllers.jobs - INFO - 127.0.0.1 - GET /api/v1/jobs/363026fa-5819-11ec-be69-002b6a01557c/events/30-002b6a01-557c-a4a3-25e5-00000000001e
2021-12-08 16:53:30,082 - runner_service.controllers.playbooks - DEBUG - Request received, content-type :None
2021-12-08 16:53:30,083 - runner_service.controllers.playbooks - INFO - 127.0.0.1 - GET /api/v1/playbooks/363026fa-5819-11ec-be69-002b6a01557c
2021-12-08 16:53:30,083 - runner_service.services.playbook - DEBUG - runner_cache 'hit' for playbook status request
2021-12-08 16:53:30,085 - runner_service.controllers.jobs - DEBUG - Request received, content-type :None
2021-12-08 16:53:30,085 - runner_service.controllers.jobs - INFO - 127.0.0.1 - GET /api/v1/jobs/363026fa-5819-11ec-be69-002b6a01557c/events
2021-12-08 16:53:30,085 - runner_service.services.jobs - DEBUG - Job events for play 363026fa-5819-11ec-be69-002b6a01557c: 31
2021-12-08 16:53:30,085 - runner_service.services.jobs - DEBUG - Active filter is :{}
2021-12-08 16:53:30,087 - runner_service.controllers.jobs - DEBUG - Request received, content-type :None
2021-12-08 16:53:30,087 - runner_service.controllers.jobs - INFO - 127.0.0.1 - GET /api/v1/jobs/363026fa-5819-11ec-be69-002b6a01557c/events
2021-12-08 16:53:30,087 - runner_service.services.jobs - DEBUG - Job events for play 363026fa-5819-11ec-be69-002b6a01557c: 31
2021-12-08 16:53:30,087 - runner_service.services.jobs - DEBUG - Active filter is :{}
2021-12-08 16:56:28,306 - root - INFO - Analysing local configuration options from /etc/ansible-runner-service/config.yaml
2021-12-08 16:56:28,313 - root - INFO - - setting playbooks_root_dir to /usr/share/ovirt-engine/ansible-runner-service-project
2021-12-08 16:56:28,313 - root - INFO - - setting ssh_private_key to /etc/pki/ovirt-engine/keys/engine_id_rsa
2021-12-08 16:56:28,313 - root - INFO - - setting port to 50001
2021-12-08 16:56:28,313 - root - INFO - - setting target_user to root
2021-12-08 16:56:28,314 - root - INFO - - setting log_path to /var/log/ovirt-engine
2021-12-08 16:56:28,314 - root - INFO - - setting ssh_checks to False
2021-12-08 16:56:28,314 - root - INFO - Analysing runtime overrides from environment variables
2021-12-08 16:56:28,314 - root - INFO - No configuration settings overridden
2021-12-08 16:56:28,316 - root - INFO - Loaded logging configuration from /etc/ansible-runner-service/logging.yaml
2021-12-08 16:56:28,330 - runner_service.controllers.playbooks - DEBUG - Request received, content-type :None
2021-12-08 16:56:28,331 - runner_service.controllers.playbooks - INFO - 127.0.0.1 - GET /api/v1/playbooks/363026fa-5819-11ec-be69-002b6a01557c
2021-12-08 16:56:28,331 - runner_service.services.playbook - DEBUG - runner_cache 'miss' for run 363026fa-5819-11ec-be69-002b6a01557c
2021-12-08 16:56:28,333 - runner_service.services.playbook - WARNING - Status Request for Play uuid '363026fa-5819-11ec-be69-002b6a01557c', found an incomplete artifacts directory...Possible ansible_runner  error?
2021-12-08 16:56:28,344 - runner_service.controllers.jobs - DEBUG - Request received, content-type :None
2021-12-08 16:56:28,344 - runner_service.controllers.jobs - INFO - 127.0.0.1 - GET /api/v1/jobs/363026fa-5819-11ec-be69-002b6a01557c/events
2021-12-08 16:56:28,361 - runner_service.services.jobs - DEBUG - Job events for play 363026fa-5819-11ec-be69-002b6a01557c: 31
2021-12-08 16:56:28,361 - runner_service.services.jobs - DEBUG - Active filter is :{}


Here we see the ansible-runner-service restarted, then not finding the uuid artifact all the sudden. In such case the response to the engine will be "unknown" and we will keep polling the job.

We need to figure what happened to the artifact if we can and think of failing/succeeding the job in case of 'unknown' response.

Comment 3 Arik 2022-06-15 15:30:59 UTC
Didn't reproduce with on latest version, should be fixed by other changesto ansible-runner

Comment 6 Nisim Simsolo 2022-06-16 14:48:35 UTC
Verified (didn't manage to reproduce):
ovirt-engine-4.5.1.1-0.14.el8ev
vdsm-4.50.1.2-1.el8ev.x86_64
qemu-kvm-6.2.0-11.module+el8.6.0+15489+bc23efef.1.x86_64
libvirt-8.0.0-5.2.module+el8.6.0+15256+3a0914fe.x86_64

Comment 10 errata-xmlrpc 2022-07-14 12:54:31 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: RHV Manager (ovirt-engine) [ovirt-4.5.1] security, bug fix and update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5555


Note You need to log in before you can comment on or make changes to this bug.