Bug 1975225

Summary: Occasional failures to export VM to OVA
Product: [oVirt] ovirt-engine Reporter: Arik <ahadas>
Component: BLL.VirtAssignee: Liran Rotenberg <lrotenbe>
Status: CLOSED CURRENTRELEASE QA Contact: Qin Yuan <qiyuan>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.4.7CC: bugs, dfodor, fjanuska
Target Milestone: ovirt-4.4.8Keywords: Reopened
Target Release: ---Flags: pm-rhel: ovirt-4.4+
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: ovirt-engine-4.4.8 Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-08-19 06:23:13 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Virt RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Arik 2021-06-23 09:49:33 UTC
From time to time exporting a VM to an OVA fails.
It seems that something happens right after getting to the step of invoking the packing script (pack_ova.py) that makes the engine think the operation failed.

It doesn't seem to be a problem on with the host though since an attempt to export a template to OVA on that host succeeds right after.

Comment 4 Arik 2021-06-23 10:05:51 UTC
Filip, can you please also check how come that OST didn't fail on that and tried to import that OVA?

Comment 5 Filip Januška 2021-06-23 11:30:18 UTC
The way OST checks the result of this export doesn't seem very reliable. During the export a temporal vm snapshot is created and then deleted after the export is complete. The OST only checks if this snapshot is not present, which it won't be whether the export fails or succeeds. Perhaps we should check for the actual .ova file on the host?

Comment 6 Arik 2021-06-23 12:23:52 UTC
(In reply to Filip Januška from comment #5)
> The way OST checks the result of this export doesn't seem very reliable.
> During the export a temporal vm snapshot is created and then deleted after
> the export is complete. The OST only checks if this snapshot is not present,
> which it won't be whether the export fails or succeeds. Perhaps we should
> check for the actual .ova file on the host?

Ah yes, that takes me back to the time it was added..
The rational was not to make any "heavy" call but just to check if the snapshot is still there in order to determine whether the export command is still executed.
Then, when we know the export command (that is executed asynchronously) is completed, we can check whether it succeeded or not
I wouldn't change the way we check if the export command is completed or not, and I wouldn't add another check for the existence of the OVA file (because that's what the import command does at the beginning) but I think we should check if the export command succeeded or not by its execution job 
If we're able to say "export command ended with failure" - it would be less confusing (initially I suspected that another job removed the ova before we got to the import phase)

Comment 7 Arik 2021-06-23 12:24:50 UTC
still executed -> still executing

Comment 10 Arik 2021-07-04 13:35:34 UTC
This one is very difficult to reproduce, only regression testing by automation is required

Comment 11 Qin Yuan 2021-07-07 08:24:03 UTC
Verified with:
ovirt-engine-4.4.7.6-0.11.el8ev.noarch

There is no regression found in automation tests regarding to exporting VM to OVA.

Comment 12 Sandro Bonazzola 2021-07-08 14:15:34 UTC
This bugzilla is included in oVirt 4.4.7 release, published on July 6th 2021.

Since the problem described in this bug report should be resolved in oVirt 4.4.7 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.

Comment 13 Arik 2021-07-08 20:16:46 UTC
Happened again with the fix:
converting disk: /rhev/data-center/mnt/192.168.202.2:_exports_nfs_share2/49b69e88-f7a9-4daa-8add-69c07629c465/images/699a4e21-8eb5-4d54-a21b-fa1bb877bca9/7b1674d2-185e-4c2a-aeb7-1ceb4cc1c8d6, offset 19968
losetup: /var/tmp/ova_vm.ova.tmp: failed to set up loop device: Resource temporarily unavailable

And in 'messages':
Jul  7 11:31:06 lago-basic-suite-master-host-0 kernel: loop: module loaded
Jul  7 11:31:06 lago-basic-suite-master-host-0 kernel: loop0: detected capacity change from 0 to 414208
Jul  7 11:31:06 lago-basic-suite-master-host-0 kernel: loop_set_status: loop0 () has still dirty pages (nrpages=1)

Seems that others have faced this issue as well:
https://www.spinics.net/lists/kernel/msg3975499.html

Their proposed fix:
https://www.spinics.net/lists/kernel/msg3977449.html

If we can't ensure that there are no dirty pages at the time we setup the loop back device, we can take the same approach (since create-ova needs to work on clusters that won't get the fix [1]) at pack_ova.py by identifying if the error is "Resource temporarily unavailable" (which is EAGAIN) and retry 64 times. Need to think of how to avoid iterating 64 times when losetup would also iterate 64 times internally though.

[1] https://github.com/karelzak/util-linux/commit/3e03cb680668e4d47286bc7e6ab43e47bb84c989

Comment 15 Arik 2021-07-08 21:47:00 UTC
(In reply to Arik from comment #13)
> If we can't ensure that there are no dirty pages at the time we setup the
> loop back device, we can take the same approach (since create-ova needs to
> work on clusters that won't get the fix [1]) at pack_ova.py by identifying
> if the error is "Resource temporarily unavailable" (which is EAGAIN) and
> retry 64 times. Need to think of how to avoid iterating 64 times when
> losetup would also iterate 64 times internally though.

Of course it doesn't have to be 64 times.. we can check it after a second or few seconds several times

Comment 16 Qin Yuan 2021-08-13 05:05:17 UTC
Verified with:
ovirt-engine-4.4.8.3-0.10.el8ev.noarch

No regression issues were found.

Comment 17 Sandro Bonazzola 2021-08-19 06:23:13 UTC
This bugzilla is included in oVirt 4.4.8 release, published on August 19th 2021.

Since the problem described in this bug report should be resolved in oVirt 4.4.8 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.