1975225 – Occasional failures to export VM to OVA

Bug 1975225 - Occasional failures to export VM to OVA

Summary: Occasional failures to export VM to OVA

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	ovirt-engine
Classification:	oVirt
Component:	BLL.Virt
Sub Component:
Version:	4.4.7
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	ovirt-4.4.8
Target Release:	---
Assignee:	Liran Rotenberg
QA Contact:	Qin Yuan
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-06-23 09:49 UTC by Arik
Modified:	2021-08-19 06:23 UTC (History)
CC List:	3 users (show)
Fixed In Version:	ovirt-engine-4.4.8
Clone Of:
Environment:
Last Closed:	2021-08-19 06:23:13 UTC
oVirt Team:	Virt
Embargoed:
Dependent Products:
Flags:	pm-rhel: ovirt-4.4+

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
oVirt gerrit	115379	0	master	MERGED	core: pack_ova: flush loop device	2021-06-24 07:03:28 UTC
oVirt gerrit	115657	0	master	MERGED	core: pack_ova: wait for loop device	2021-07-12 15:48:09 UTC

Description Arik 2021-06-23 09:49:33 UTC

From time to time exporting a VM to an OVA fails.
It seems that something happens right after getting to the step of invoking the packing script (pack_ova.py) that makes the engine think the operation failed.

It doesn't seem to be a problem on with the host though since an attempt to export a template to OVA on that host succeeds right after.

Comment 4 Arik 2021-06-23 10:05:51 UTC

Filip, can you please also check how come that OST didn't fail on that and tried to import that OVA?

Comment 5 Filip Januška 2021-06-23 11:30:18 UTC

The way OST checks the result of this export doesn't seem very reliable. During the export a temporal vm snapshot is created and then deleted after the export is complete. The OST only checks if this snapshot is not present, which it won't be whether the export fails or succeeds. Perhaps we should check for the actual .ova file on the host?

Comment 6 Arik 2021-06-23 12:23:52 UTC

(In reply to Filip Januška from comment #5)
> The way OST checks the result of this export doesn't seem very reliable.
> During the export a temporal vm snapshot is created and then deleted after
> the export is complete. The OST only checks if this snapshot is not present,
> which it won't be whether the export fails or succeeds. Perhaps we should
> check for the actual .ova file on the host?

Ah yes, that takes me back to the time it was added..
The rational was not to make any "heavy" call but just to check if the snapshot is still there in order to determine whether the export command is still executed.
Then, when we know the export command (that is executed asynchronously) is completed, we can check whether it succeeded or not
I wouldn't change the way we check if the export command is completed or not, and I wouldn't add another check for the existence of the OVA file (because that's what the import command does at the beginning) but I think we should check if the export command succeeded or not by its execution job 
If we're able to say "export command ended with failure" - it would be less confusing (initially I suspected that another job removed the ova before we got to the import phase)

Comment 7 Arik 2021-06-23 12:24:50 UTC

still executed -> still executing

Comment 10 Arik 2021-07-04 13:35:34 UTC

This one is very difficult to reproduce, only regression testing by automation is required

Comment 11 Qin Yuan 2021-07-07 08:24:03 UTC

Verified with:
ovirt-engine-4.4.7.6-0.11.el8ev.noarch

There is no regression found in automation tests regarding to exporting VM to OVA.

Comment 12 Sandro Bonazzola 2021-07-08 14:15:34 UTC

This bugzilla is included in oVirt 4.4.7 release, published on July 6th 2021.

Since the problem described in this bug report should be resolved in oVirt 4.4.7 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.

Comment 13 Arik 2021-07-08 20:16:46 UTC

Happened again with the fix:
converting disk: /rhev/data-center/mnt/192.168.202.2:_exports_nfs_share2/49b69e88-f7a9-4daa-8add-69c07629c465/images/699a4e21-8eb5-4d54-a21b-fa1bb877bca9/7b1674d2-185e-4c2a-aeb7-1ceb4cc1c8d6, offset 19968
losetup: /var/tmp/ova_vm.ova.tmp: failed to set up loop device: Resource temporarily unavailable

And in 'messages':
Jul  7 11:31:06 lago-basic-suite-master-host-0 kernel: loop: module loaded
Jul  7 11:31:06 lago-basic-suite-master-host-0 kernel: loop0: detected capacity change from 0 to 414208
Jul  7 11:31:06 lago-basic-suite-master-host-0 kernel: loop_set_status: loop0 () has still dirty pages (nrpages=1)

Seems that others have faced this issue as well:
https://www.spinics.net/lists/kernel/msg3975499.html

Their proposed fix:
https://www.spinics.net/lists/kernel/msg3977449.html

If we can't ensure that there are no dirty pages at the time we setup the loop back device, we can take the same approach (since create-ova needs to work on clusters that won't get the fix [1]) at pack_ova.py by identifying if the error is "Resource temporarily unavailable" (which is EAGAIN) and retry 64 times. Need to think of how to avoid iterating 64 times when losetup would also iterate 64 times internally though.

[1] https://github.com/karelzak/util-linux/commit/3e03cb680668e4d47286bc7e6ab43e47bb84c989

Comment 15 Arik 2021-07-08 21:47:00 UTC

(In reply to Arik from comment #13)
> If we can't ensure that there are no dirty pages at the time we setup the
> loop back device, we can take the same approach (since create-ova needs to
> work on clusters that won't get the fix [1]) at pack_ova.py by identifying
> if the error is "Resource temporarily unavailable" (which is EAGAIN) and
> retry 64 times. Need to think of how to avoid iterating 64 times when
> losetup would also iterate 64 times internally though.

Of course it doesn't have to be 64 times.. we can check it after a second or few seconds several times

Comment 16 Qin Yuan 2021-08-13 05:05:17 UTC

Verified with:
ovirt-engine-4.4.8.3-0.10.el8ev.noarch

No regression issues were found.

Comment 17 Sandro Bonazzola 2021-08-19 06:23:13 UTC

This bugzilla is included in oVirt 4.4.8 release, published on August 19th 2021.

Since the problem described in this bug report should be resolved in oVirt 4.4.8 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.

Note You need to log in before you can comment on or make changes to this bug.