Description of problem: pack_ova.py sets a loopback device to build the OVA, writing the disk image directly at an offset on the file. https://github.com/oVirt/ovirt-engine/blob/master/packaging/ansible-runner-service-project/project/roles/ovirt-ova-pack/files/pack_ova.py#L52 def convert_disks(ova_path): for path, offset in six.iteritems(path_to_offset): print("converting disk: %s, offset %s" % (path, offset)) output = check_output(['losetup', '--find', '--show', '-o', offset, ova_path]) loop = from_bytes(output.splitlines()[0]) loop_stat = os.stat(loop) call(['udevadm', 'settle']) vdsm_user = pwd.getpwnam('vdsm') os.chown(loop, vdsm_user.pw_uid, vdsm_user.pw_gid) try: qemu_cmd = ("qemu-img convert -p -T none -O qcow2 '%s' '%s'" % (path, loop)) check_call(['su', '-p', '-c', qemu_cmd, 'vdsm']) This offset can cause unaligned writes, reducing performance greatly if its over NFS to another server. See: # Write to loopback device backed on file on NFS, no offset $ losetup --find --show /ova/loop_test.qcow2 $ time qemu-img convert -p -T none -O qcow2 /dev/640bd68d-bfde-45e6-9333-71316fc46893/210f2a85-55dd-4217-8889-39c51d3ef89e /dev/loop0 (100.00/100%) real 1m31.108s user 0m5.053s sys 0m20.775s # Write to loopback device backed on file on NFS, with offset $ losetup --find --show -o 14842 /ova/loop_test.qcow2 $ time qemu-img convert -p -T none -O qcow2 /dev/640bd68d-bfde-45e6-9333-71316fc46893/210f2a85-55dd-4217-8889-39c51d3ef89e /dev/loop0 (100.00/100%) real 18m18.530s user 0m4.254s sys 0m46.925s The NFS share on this case had rw,relatime,vers=4.1,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,local_lock=none Version-Release number of selected component (if applicable): * RHV 4.3.10 + RHEL 7.8 - 3.10.0-1127.el7.x86_64 (test above) * RHV 4.4.4 + RHEL 8.3 - 4.18.0-240.10.1.el8_3.x86_64 (test below) On latest RHVH 4.4.4 with RHEL 8.3 as NFS server, similar thing: host2.kvm:/exports/nfs on /mnt/nfs type nfs4 (rw,relatime,vers=4.2,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=192.168.100.1,local_lock=none,addr=192.168.100.2) # losetup --find --show -o 14842 /mnt/nfs/test_offset /dev/loop0 # losetup --find --show /mnt/nfs/test /dev/loop1 # time qemu-img convert -O qcow2 -T none /dev/test/lv1 /dev/loop0 real 0m3.852s user 0m0.107s sys 0m1.265s # time qemu-img convert -O qcow2 -T none /dev/test/lv1 /dev/loop1 real 0m1.917s user 0m0.135s sys 0m0.477s How reproducible: Always, but on RHEL8.3 it doesn't seem to be as pronounced as on RHEL 7.8 host. Steps to Reproduce: 1. Create lo devices backed on NFS $ truncate -s 2G /mnt/test0 $ truncate -s 2G /mnt/test1 $ losetup --find --show -o 14842 /mnt/test0 $ losetup --find --show /mnt/test1 2. Convert $ qemu-img convert -p -T none -O qcow2 /dev/test/lv1 /dev/loop0 $ qemu-img convert -p -T none -O qcow2 /dev/test/lv1 /dev/loop1 Actual results: * Much slower OVA export over NFS. * Fails if engine ansible timeout is not tuned. Additional info: * Over local disk this does not seem to have a big impact. * No matter the storage backing this NFS, this is slower, even if its backed by tmpfs on NFS server
The important part that I tested is about getting timeout: Actual results: * Much slower OVA export over NFS. * Fails if engine ansible timeout is not tuned. (from comment #0) This didn't fail to me in around ~4 hours running while the pack_ova.py ran with: while True: sleep(1000) The engine kept the command running and failed only after I killed the process on the host. This is because the move to run the long export OVA parts in async within the engine.
(In reply to Liran Rotenberg from comment #3) > This is because the move to run the long export OVA parts in async within > the engine. Yeah, that's an important thing to note - that we significantly changed the way that ansible script is executed in 4.4.4. So it makes sense that the export-to-OVA task no longer times out. That said, we can align the offset in order to improve the time to export the OVA to NFS.
Need to consider forward compatibility
Although we implemented the change, we don't notice a difference on QE environments We can't say that it's fixed but on the other hand, it may be improved in some scenario / on some hardware Therefore failing this bug for now and re-targeting it to 4.5.1, we'll try to investigate this a bit further by then
Verified: ovirt-engine-4.5.0.6-0.7.el8ev vdsm-4.50.0.13-1.el8ev.x86_64 qemu-kvm-6.2.0-11.module+el8.6.0+14707+5aa4b42d.x86_64 libvirt-daemon-8.0.0-5.module+el8.6.0+14480+c0a3aa0f.x86_64 ansible-runner-2.1.3-1.el8ev.noarch
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: RHV Manager (ovirt-engine) [ovirt-4.5.0] security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:4711