Bug 1684266 - Exporting OVA timed out leaving orphan volume [NEEDINFO]
Summary: Exporting OVA timed out leaving orphan volume
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: vdsm
Version: 4.2.7
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: ovirt-4.4.0
: ---
Assignee: Shmuel Melamud
QA Contact: Nisim Simsolo
URL:
Whiteboard:
Depends On: 1825638
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-02-28 20:24 UTC by Javier Coscia
Modified: 2020-11-11 09:58 UTC (History)
14 users (show)

Fixed In Version: rhv-4.4.0-29
Doc Type: Bug Fix
Doc Text:
When a large disk is converted as part of VM export to OVA, it takes a long time. Previously, the SSH channel the export script timed out and closed due to the long period of inactivity, leaving an orphan volume. The current release fixes this issue: Now, the export script adds some traffic to the SSH channel during disk conversion to prevent the SSH channel from being closed.
Clone Of:
Environment:
Last Closed: 2020-08-04 13:26:25 UTC
oVirt Team: Virt
Target Upstream Version:
rdlugyhe: needinfo? (smelamud)
lsvaty: testing_plan_complete-


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHEA-2020:3246 0 None None None 2020-08-04 13:26:51 UTC
oVirt gerrit 99807 0 'None' MERGED core: SSH channel traffic during image conversion 2021-02-01 13:26:26 UTC

Description Javier Coscia 2019-02-28 20:24:28 UTC
Description of problem:

Exporting a vDisk as OVA on RHV which times out after the default 30 mins 
will leave an orphan logical volume consuming space in the storage domain.

TeardownImageVDSCommand and DeleteImageGroupVDSCommand fails because the 
volume was still `in use`

Version-Release number of selected component (if applicable):

ovirt-engine-4.2.7.5-0.1.el7ev.noarch
vdsm-4.20.43-1.el7ev.x86_64

How reproducible:

When ansible times out exporting/importing OVA image

Steps to Reproduce:
1. Export a VM with a big disk as an OVA image, any size that exceeds the 
   default 30 mins ansible time out would do the trick

2. Task will fail and attempting to tear down the image and remove it will 
   also fail due to volume still in use
   
3. Volume will remain in the storage domain with the remove_me tag, orphan


Actual results:

Volume remains in storage domain with out use

Expected results:

Volume should be removed

Additional info:

Will attach engine.log, vdsm.lo and ansible.log

Comment 4 Tal Nisan 2019-03-04 09:26:20 UTC
Arik, can you please have a look?

Comment 5 Ryan Barry 2019-04-15 13:50:15 UTC
Javier, do we know if the VM was running? It's not clear from the logs (doesn't start early enough)

Comment 8 Ryan Barry 2019-04-22 13:25:38 UTC
So, we can increase the timeout here, but it's not a guarantee that the guest disks are unlocked. Tal - since the VM is down, we'll need help from storage to investigate why the LVs were locked. Any ideas?

Comment 9 Shmuel Melamud 2019-04-22 13:33:47 UTC
(In reply to Ryan Barry from comment #8)
> So, we can increase the timeout here, but it's not a guarantee that the
> guest disks are unlocked. Tal - since the VM is down, we'll need help from
> storage to investigate why the LVs were locked. Any ideas?

I suppose that unlocking the LV failed because qemu-img process was still running and using the disk. Only the SSH connection was closed by timeout.

Comment 10 Ryan Barry 2019-04-22 13:36:49 UTC
I thought this also, but comment#6 looks like it's down (I'm waiting to get the rest of the engine log to see if something else brought it back up)

Comment 11 Shmuel Melamud 2019-04-22 13:43:35 UTC
(In reply to Ryan Barry from comment #10)
> I thought this also, but comment#6 looks like it's down (I'm waiting to get
> the rest of the engine log to see if something else brought it back up)

2019-02-08 - it is a week before the issue. Why we need to take this into account? The engine.log in attachment 1539639 [details] starts from 2019-02-09 and the Engine is running at that moment.

Comment 12 Ryan Barry 2019-04-22 13:46:27 UTC
That's why you're on the bug ;)

Because I didn't find entries in the current engine log to indicate whether the VM which was being exported was up or down. If it was down a week beforehand, the qemu-img process shouldn't still be running and using the disk unless I've missed something in the logs

Comment 13 Shmuel Melamud 2019-04-22 14:02:48 UTC
(In reply to Ryan Barry from comment #12)
> Because I didn't find entries in the current engine log to indicate whether
> the VM which was being exported was up or down. If it was down a week
> beforehand, the qemu-img process shouldn't still be running and using the
> disk unless I've missed something in the logs

The qemu-img process is not related to the VM running. During the export, pack_ova.py script executes qemu-img to convert disks.

Comment 14 Fred Rolland 2019-04-29 13:11:57 UTC
Shmuel, what are the next steps?

Comment 15 Shmuel Melamud 2019-04-29 14:50:43 UTC
I am currently looking for an easy way to populate SSH channel with some traffic during the conversion process and verifying that it does not break the functionality. I'll post the patch soon.

Comment 16 Tal Nisan 2019-04-30 05:55:17 UTC
Moving to Virt since you are looking into the solution

Comment 17 Sandro Bonazzola 2019-06-26 14:27:30 UTC
Since we have a patch, can you please target this bug? 4.4? 4.3.5?

Comment 18 Daniel Gur 2019-08-28 13:14:36 UTC
sync2jira

Comment 19 Daniel Gur 2019-08-28 13:19:38 UTC
sync2jira

Comment 25 RHV bug bot 2019-12-13 13:16:49 UTC
WARN: Bug status (ON_QA) wasn't changed but the folowing should be fixed:

[Found non-acked flags: '{}', ]

For more info please contact: rhv-devops@redhat.comINFO: Bug status (ON_QA) wasn't changed but the folowing should be fixed:

[Found non-acked flags: '{}', ]

For more info please contact: rhv-devops@redhat.com

Comment 26 RHV bug bot 2019-12-20 17:46:11 UTC
WARN: Bug status (ON_QA) wasn't changed but the folowing should be fixed:

[Found non-acked flags: '{}', ]

For more info please contact: rhv-devops@redhat.comINFO: Bug status (ON_QA) wasn't changed but the folowing should be fixed:

[Found non-acked flags: '{}', ]

For more info please contact: rhv-devops@redhat.com

Comment 27 RHV bug bot 2020-01-08 14:50:30 UTC
WARN: Bug status (ON_QA) wasn't changed but the folowing should be fixed:

[Found non-acked flags: '{}', ]

For more info please contact: rhv-devops@redhat.comINFO: Bug status (ON_QA) wasn't changed but the folowing should be fixed:

[Found non-acked flags: '{}', ]

For more info please contact: rhv-devops@redhat.com

Comment 28 RHV bug bot 2020-01-08 15:18:48 UTC
WARN: Bug status (ON_QA) wasn't changed but the folowing should be fixed:

[Found non-acked flags: '{}', ]

For more info please contact: rhv-devops@redhat.comINFO: Bug status (ON_QA) wasn't changed but the folowing should be fixed:

[Found non-acked flags: '{}', ]

For more info please contact: rhv-devops@redhat.com

Comment 29 RHV bug bot 2020-01-24 19:52:11 UTC
WARN: Bug status (ON_QA) wasn't changed but the folowing should be fixed:

[Found non-acked flags: '{}', ]

For more info please contact: rhv-devops@redhat.comINFO: Bug status (ON_QA) wasn't changed but the folowing should be fixed:

[Found non-acked flags: '{}', ]

For more info please contact: rhv-devops@redhat.com

Comment 31 Nisim Simsolo 2020-06-03 08:23:06 UTC
Verified:
ovirt-engine-4.4.1.1-0.5.el8ev.noarch
vdsm-4.40.17-1.el8ev.x86_64
libvirt-daemon-6.0.0-22.module+el8.2.1+6815+1c792dc8.x86_64
qemu-kvm-4.2.0-22.module+el8.2.1+6758+cb8d64c2.x86_64

Verification scenario:
1. Create VM with 200GB preallocated NFS disk. Install latest RHEL8 OS on it and verify OS is running properly.
2. Export OVA to host NFS mount (use NFS in order to make export time to take longer).
   Verify OVA exported successfully and took more than 30 minutes (it took 43:51 minutes to export it).
3. Import exported OVA (Import time was 2:26 minutes)
4. Run imported VM.
   Verify OS is running properly and disk size is 200GB with thin provision allocation policy.

Comment 37 errata-xmlrpc 2020-08-04 13:26:25 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (RHV RHEL Host (ovirt-host) 4.4), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2020:3246


Note You need to log in before you can comment on or make changes to this bug.