Description of problem: As part of our automation TC27605, we try to download 4 snapshot disks in parallel using python SDK: download_disk_snapshot.py Sometimes, one of the download operations fails with the following error: Starting multi threaded snapshot disks downloads [10.35.232.28] Executing command python3 /usr/share/doc/python3-ovirt-engine-sdk4/examples/download_disk_snapshot.py iscsi_1 811727c0-8925-41ea-8d9f-0e7331546ce6 /root/download/811727c0-8925-41ea-8d9f-0e7331546ce6 -c engine [10.35.232.28] Executing command python3 /usr/share/doc/python3-ovirt-engine-sdk4/examples/download_disk_snapshot.py iscsi_1 dc80b2c9-4866-4e93-8f90-b670eb58c7c1 /root/download/dc80b2c9-4866-4e93-8f90-b670eb58c7c1 -c engine [10.35.232.28] Executing command python3 /usr/share/doc/python3-ovirt-engine-sdk4/examples/download_disk_snapshot.py iscsi_1 1679debf-1e65-4434-9342-c85f9452b6a4 /root/download/1679debf-1e65-4434-9342-c85f9452b6a4 -c engine [10.35.232.28] Executing command python3 /usr/share/doc/python3-ovirt-engine-sdk4/examples/download_disk_snapshot.py iscsi_1 ac7b998d-c0e7-4bb2-a012-f596584c8065 /root/download/ac7b998d-c0e7-4bb2-a012-f596584c8065 -c engine [10.35.232.28] Failed to run command ['python3', '/usr/share/doc/python3-ovirt-engine-sdk4/examples/download_disk_snapshot.py', 'iscsi_1', 'dc80b2c9-4866-4e93-8f90-b670eb58c7c1', '/root/download/dc80b2c9-4866-4e93-8f90-b670eb58c7c1', '-c', 'engine'] ERR: Traceback (most recent call last): File "/usr/share/doc/python3-ovirt-engine-sdk4/examples/download_disk_snapshot.py", line 165, in <module> **extra_args) File "/usr/lib64/python3.6/site-packages/ovirt_imageio/client/_api.py", line 186, in download name="download") File "/usr/lib64/python3.6/site-packages/ovirt_imageio/_internal/io.py", line 69, in copy log.debug("Executor failed") File "/usr/lib64/python3.6/site-packages/ovirt_imageio/_internal/io.py", line 193, in __exit__ self.stop() File "/usr/lib64/python3.6/site-packages/ovirt_imageio/_internal/io.py", line 170, in stop raise self._errors[0] File "/usr/lib64/python3.6/site-packages/ovirt_imageio/_internal/io.py", line 235, in _run handler = self._handler_factory() File "/usr/lib64/python3.6/site-packages/ovirt_imageio/_internal/io.py", line 262, in __init__ self._src = src_factory() File "/usr/lib64/python3.6/site-packages/ovirt_imageio/_internal/backends/http.py", line 82, in clone con = self._clone_connection() File "/usr/lib64/python3.6/site-packages/ovirt_imageio/_internal/backends/http.py", line 420, in _clone_connection return self._create_unix_connection(self.server_address) File "/usr/lib64/python3.6/site-packages/ovirt_imageio/_internal/backends/http.py", line 392, in _create_unix_connection con.connect() File "/usr/lib64/python3.6/site-packages/ovirt_imageio/_internal/backends/http.py", line 627, in connect self.sock.connect(self.path) BlockingIOError: [Errno 11] Resource temporarily unavailable we start 4 downloads, each download uses 4 connections = 16 concurrent connections. looks like bug 1925345 Version-Release number of selected component (if applicable): Seen for a while now in both 4.4 and 4.5. After a deep investigation, decided to open a bug. How reproducible: Barly. Once in a while, it is reproduced in our regression runs. Steps to Reproduce: 1. Clone VM from template with disks permutations including FS on block SD: virtio_scsicow, virtio_scsiraw, virtioraw, virtiocow 2. Make a snapshot of the VM 3. Using the SDK example script, download the snapshot disks in parallel on the SPM host: a. TC1: with finalization of the transfer b. TC2: Without finalization Actual results: Sometimes, one of the download operations fails on 'BlockingIOError: [Errno 11] Resource temporarily unavailable' Expected results: All download operations should succeed. Additional info:
version: ovirt-imageio-client-2.3.0-1.el8ev.x86_64
Evelina, can you confirm that this build solves the issue? https://github.com/oVirt/ovirt-imageio/actions/runs/2013146111 To test, you can download this zip file: https://github.com/oVirt/ovirt-imageio/suites/5730182938/artifacts/189701716 mkdir test cd test unzip ../rpm-centos-8.zip dnf upgrade *.rpm Running the automated tests starting 4 concurrent downloads should not fail, maybe run it 10 times to be sure. With the fix you should be able to download 10 disk in parallel without any error. This uses 40 concurrent connections to imageio server.
Proposing for 4.5.0 since this is a trivial fix and easy to test. It is unlikely to happen in real usage, so we can also deliver this in 4.5.1.
(In reply to Nir Soffer from comment #3) > Proposing for 4.5.0 since this is a trivial fix and easy to test. > > It is unlikely to happen in real usage, so we can also deliver this > in 4.5.1. As according to Evelina it affects our automation and the fix is trivial, let's aim for 4.5.0
Fixed merged, will be available in ovirt-imageio 2.4.2-1.
(In reply to Nir Soffer from comment #2) > Evelina, can you confirm that this build solves the issue? > https://github.com/oVirt/ovirt-imageio/actions/runs/2013146111 > > To test, you can download this zip file: > https://github.com/oVirt/ovirt-imageio/suites/5730182938/artifacts/189701716 > > mkdir test > cd test > unzip ../rpm-centos-8.zip > dnf upgrade *.rpm > > Running the automated tests starting 4 concurrent downloads should not fail, > maybe run it 10 times to be sure. > > With the fix you should be able to download 10 disk in parallel without any > error. This uses 40 concurrent connections to imageio server. Will be tested as part of our automation runs as this one is hard to reproduce.
The issue was inheriting the default listen backlog size (5) from python standard library. This never caused a problem since nobody tried to start more than 5 transfers at the same time. The QE test starting 4 transfers in the same time, when each of them starting 4 connections at the same time can fail randomly with the default listen backlog. The listen backlog was change to 40 to allow up to 10 transfers started at the same time, using the default 4 connections per transfers. We have a new automated tests starting 10 transfers at the same time, so this is unlikely to break: https://github.com/oVirt/ovirt-imageio/blob/385e9e460b7487569adf07a00d3405aba71e46d8/test/client_test.py#L1071 Running the current automated tests will be enough for testing this change. The tests that used to fail randomly should not fail now.
This bugzilla is included in oVirt 4.5.0 release, published on April 20th 2022. Since the problem described in this bug report should be resolved in oVirt 4.5.0 release, it has been closed with a resolution of CURRENT RELEASE. If the solution does not work for you, please open a new bug report.
Verified. All download operations succeeded. Versions: ovirt-engine-4.5.0.2-0.7.el8ev ovirt-imageio-2.4.3-1