Description of problem: When importing a Raw formatted VM from a KVM Source where the Storage Domain is a SCSI Storage Domain, the process fails to complete when calling the ovirt-imageio progress routine within the vdsm's kvm2ovirt module. The Allocation type for Block Storage is set to Preallocated, so the download_disk is called as per design, but the function never returns (it ran overnight). Attempting to cancel the conversion via the Webadmin causes an additional failure reported in the vdsm log as follows, instead of returning gracefully: 2020-06-25 11:39:36,527+0000 ERROR (v2v/084a9b0c) [storage.TaskManager.Task] (Task='3c340a19-ef54-47f9-9a20-96e78fcfb7ed') Unexpected error (task:880) Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/vdsm/v2v.py", line 861, in _run self._import() File "/usr/lib/python3.6/site-packages/vdsm/v2v.py", line 880, in _import self._watch_process_output() File "/usr/lib/python3.6/site-packages/vdsm/v2v.py", line 905, in _watch_process_output for event in parser.parse(out): File "/usr/lib/python3.6/site-packages/vdsm/v2v.py", line 962, in parse for chunk in self._iter_progress(stream): File "/usr/lib/python3.6/site-packages/vdsm/v2v.py", line 981, in _iter_progress raise OutputParserError('copy-disk stream closed unexpectedly') vdsm.v2v.OutputParserError: copy-disk stream closed unexpectedly Version-Release number of selected component (if applicable): vdsm.x86_64 4.40.16-1.el8 @System ovirt-imageio-common.x86_64 2.0.6-0.el8 @System ovirt-imageio-daemon.x86_64 2.0.6-0.el8 @System How reproducible: Steps to Reproduce: 1. Create a SCSI Storage Domain 2. Import a Raw format VM via a KVM source. 3. Actual results: Though the Allocation Type is set in the ovirt engine to Preallocated when Block Storage is detected, the process hangs in the kvm2ovirt process loop: with progress(op, estimated_size): op.run() Expected results: The image copy would complete successfully and return to the kvm2ovirt process so that the import would complete successfully. Additional info:
Created attachment 1698782 [details] Script used to generate the scsi storage domain
ovirt-imageio is not used in this flow. Moving to vdsm since the code doing the import is part of vdsm. Changing group to virt since storage team does not maintain the kvm2ovirt.py script.
Created attachment 1698788 [details] Relevant Import logs Relevant import logs including kvm2ovirt log where the kvm2ovirt has been modified for additional traces: def download_disk(adapter, estimated_size, size, dest, bufsize): write_output('Calls direct io') op = directio.Receive(dest, adapter, size=size, buffersize=bufsize) with progress(op, estimated_size): op.run() adapter.finish() write_output('finished')
This is not an external integration but part of the project, hence it's inherently supported. This is not the first time issues with internal interface came up, and one of the reasons is that this is lacking integration test. Please cover the desired use case by OST
Steven, to save time debugging your iscsi server, please rebuild your server using these scripts: https://github.com/oVirt/vdsm/tree/master/contrib/target-tools $ contrib/target-tools/create-target -h Usage: make-target [options] NAME Options: -h, --help show this help message and exit -s LUN_SIZE, --lun-size=LUN_SIZE LUN size in GiB (default 100) -n LUN_COUNT, --lun-count=LUN_COUNT Number of LUNs (default 10) -r ROOT_DIR, --root-dir=ROOT_DIR root directory (default /target) -i IQN_BASE, --iqn-base=IQN_BASE iqn base name (default iqn.2003-01.org)
Steven, looking in your daemon log, it seems that you are not using latest upload_disk.py. Please install latest version from: - fedora: https://jenkins.ovirt.org/job/ovirt-engine-sdk_standard-on-merge/77/artifact/build-artifacts.fc30.x86_64/ - centos/rhel: https://jenkins.ovirt.org/job/ovirt-engine-sdk_standard-on-merge/77/artifact/build-artifacts.el8.x86_64/ Looking in your daemon log: 1. A ticket was added 2020-06-30 19:56:32,204 INFO (Thread-4) [tickets] [local] ADD ticket={'dirty': False, 'ops': ['write'], 'size': 10737418240, 'sparse': False, 'transfer_id': '901212bc-9c9e-4613-83a0-b3ce4256538b', 'uuid': '2beaa466-7cc1-4acb-8baa-ee763deeffd8', 'timeout': 300, 'url': 'nbd:unix:/var/run/vdsm/nbd/2beaa466-7cc1-4acb-8baa-ee763deeffd8.sock'} 2. Client from ::ffff:10.46.10.158 started a transfer 2020-06-30 19:56:32,206 INFO (Thread-4) [http] CLOSE client=local [connection 1 ops, 0.003885 s] [dispatch 1 ops, 0.002944 s] 2020-06-30 19:56:33,653 INFO (Thread-5) [http] OPEN client=::ffff:10.46.10.158 2020-06-30 19:56:33,654 INFO (Thread-5) [images] [::ffff:10.46.10.158] OPTIONS ticket=2beaa466-7cc1-4acb-8baa-ee763deeffd8 2020-06-30 19:56:33,655 INFO (Thread-5) [backends.nbd] Open backend address='/var/run/vdsm/nbd/2beaa466-7cc1-4acb-8baa-ee763deeffd8.sock' export_name='' max_connections=8 But we do't have any more logs for this thread until the end of the log. 2020-06-30 19:58:56,905 INFO (Thread-23) [http] CLOSE client=local [connection 1 ops, 0.000720 s] [dispatch 1 ops, 0.000201 s] So it looks like ovirt-imageio daemon is blocked accessing qemu-nbd, and qemu-nbd is blocked on storage. When this happens, the request will timeout on the client side. When the client times out, it should close the connection, and we should see a log for this. Maybe you did not send the entire log? Looking in /var/log/messages, we see: 1. qemu-nbd service was started xport-name= --cache=none --aio=native /rhev/data-center/mnt/blockSD/da538992-3aaa-4d59-aa28-00966f88f0dd/images/1f10724e-96a0-4548-a861-97477bb039f2/8ed2e502-a3f3-4489-89b6- 9c414c7b3c62. Jun 30 19:56:32 vm-10-157 kernel: device-mapper: core: qemu-nbd: sending ioctl 5326 to DM device without required privilege. 2. Writing to sdb failed Jun 30 19:57:02 vm-10-157 kernel: sd 3:0:0:0: [sdb] tag#59 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE Jun 30 19:57:02 vm-10-157 kernel: sd 3:0:0:0: [sdb] tag#59 Sense Key : Not Ready [current] Jun 30 19:57:02 vm-10-157 kernel: sd 3:0:0:0: [sdb] tag#59 Add. Sense: Logical unit communication failure Jun 30 19:57:02 vm-10-157 kernel: sd 3:0:0:0: [sdb] tag#59 CDB: Write(10) 2a 00 01 12 00 00 00 20 00 00 Jun 30 19:57:02 vm-10-157 kernel: print_req_error: I/O error, dev sdb, sector 17956864 flags ca01 Jun 30 19:57:02 vm-10-157 kernel: device-mapper: multipath: Failing path 8:16. 3. Multipath fail the path Jun 30 19:57:02 vm-10-157 multipathd[6527]: sdb: mark as failed Jun 30 19:57:02 vm-10-157 multipathd[6527]: 3600140562e0fe6542ec4cdfb40ffb605: Entering recovery mode: max_retries=4 Jun 30 19:57:02 vm-10-157 multipathd[6527]: 3600140562e0fe6542ec4cdfb40ffb605: remaining active paths: 0 4. 5 seconds later, the path checker reports that the path is ok and multipath reinstate the path Jun 30 19:57:07 vm-10-157 multipathd[6527]: 3600140562e0fe6542ec4cdfb40ffb605: sdb - tur checker reports path is up Jun 30 19:57:07 vm-10-157 multipathd[6527]: 8:16: reinstated Jun 30 19:57:07 vm-10-157 multipathd[6527]: 3600140562e0fe6542ec4cdfb40ffb605: queue_if_no_path enabled Jun 30 19:57:07 vm-10-157 multipathd[6527]: 3600140562e0fe6542ec4cdfb40ffb605: Recovered to normal mode Jun 30 19:57:07 vm-10-157 multipathd[6527]: 3600140562e0fe6542ec4cdfb40ffb605: remaining active paths: 1 Jun 30 19:57:07 vm-10-157 kernel: device-mapper: multipath: Reinstating path 8:16. Jun 30 19:57:07 vm-10-157 kernel: sd 3:0:0:0: alua: port group 00 state A non-preferred supports TOlUSNA 6. Writing to the device fail again... Jun 30 19:57:07 vm-10-157 kernel: sd 3:0:0:0: [sdb] tag#63 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE Jun 30 19:57:07 vm-10-157 kernel: sd 3:0:0:0: [sdb] tag#63 Sense Key : Not Ready [current] Jun 30 19:57:07 vm-10-157 kernel: sd 3:0:0:0: [sdb] tag#63 Add. Sense: Logical unit communication failure Jun 30 19:57:07 vm-10-157 kernel: sd 3:0:0:0: [sdb] tag#63 CDB: Write(10) 2a 00 01 12 00 00 00 20 00 00 Jun 30 19:57:07 vm-10-157 kernel: print_req_error: I/O error, dev sdb, sector 17956864 flags ca01 Jun 30 19:57:07 vm-10-157 kernel: device-mapper: multipath: Failing path 8:16. 7. Multipath fail the path again Jun 30 19:57:08 vm-10-157 multipathd[6527]: sdb: mark as failed Jun 30 19:57:08 vm-10-157 multipathd[6527]: 3600140562e0fe6542ec4cdfb40ffb605: Entering recovery mode: max_retries=4 Jun 30 19:57:08 vm-10-157 multipathd[6527]: 3600140562e0fe6542ec4cdfb40ffb605: remaining active paths: 0 8. 4 seconds later the path checker reports that the path is ok and multipath reinstate the path Jun 30 19:57:12 vm-10-157 multipathd[6527]: 3600140562e0fe6542ec4cdfb40ffb605: sdb - tur checker reports path is up Jun 30 19:57:12 vm-10-157 multipathd[6527]: 8:16: reinstated Jun 30 19:57:12 vm-10-157 multipathd[6527]: 3600140562e0fe6542ec4cdfb40ffb605: queue_if_no_path enabled Jun 30 19:57:12 vm-10-157 multipathd[6527]: 3600140562e0fe6542ec4cdfb40ffb605: Recovered to normal mode Jun 30 19:57:12 vm-10-157 multipathd[6527]: 3600140562e0fe6542ec4cdfb40ffb605: remaining active paths: 1 Jun 30 19:57:12 vm-10-157 kernel: device-mapper: multipath: Reinstating path 8:16. Jun 30 19:57:12 vm-10-157 kernel: sd 3:0:0:0: alua: port group 00 state A non-preferred supports TOlUSNA Jun 30 19:57:12 vm-10-157 lvm[486]: pvscan[486] PV /dev/mapper/3600140562e0fe6542ec4cdfb40ffb605 online, VG da538992-3aaa-4d59-aa28-00966f88f0dd is complete. Jun 30 19:57:12 vm-10-157 lvm[486]: pvscan[486] VG da538992-3aaa-4d59-aa28-00966f88f0dd run autoactivation. 9. This dance repeats until the end of the log, practically keeping the I/O queued forever. Jun 30 19:59:43 vm-10-157 multipathd[6527]: sdb: mark as failed Jun 30 19:59:43 vm-10-157 multipathd[6527]: 3600140562e0fe6542ec4cdfb40ffb605: Entering recovery mode: max_retries=4 Jun 30 19:59:43 vm-10-157 multipathd[6527]: 3600140562e0fe6542ec4cdfb40ffb605: remaining active paths: 0 Jun 30 19:59:47 vm-10-157 multipathd[6527]: 3600140562e0fe6542ec4cdfb40ffb605: sdb - tur checker reports path is up Jun 30 19:59:47 vm-10-157 multipathd[6527]: 8:16: reinstated Jun 30 19:59:47 vm-10-157 multipathd[6527]: 3600140562e0fe6542ec4cdfb40ffb605: queue_if_no_path enabled Jun 30 19:59:47 vm-10-157 multipathd[6527]: 3600140562e0fe6542ec4cdfb40ffb605: Recovered to normal mode Jun 30 19:59:47 vm-10-157 multipathd[6527]: 3600140562e0fe6542ec4cdfb40ffb605: remaining active paths: 1 Jun 30 19:59:47 vm-10-157 kernel: device-mapper: multipath: Reinstating path 8:16. Jun 30 19:59:47 vm-10-157 kernel: sd 3:0:0:0: alua: port group 00 state A non-preferred supports TOlUSNA Jun 30 19:59:47 vm-10-157 kernel: sd 3:0:0:0: [sdb] tag#16 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE Jun 30 19:59:47 vm-10-157 kernel: sd 3:0:0:0: [sdb] tag#16 Sense Key : Not Ready [current] Jun 30 19:59:47 vm-10-157 kernel: sd 3:0:0:0: [sdb] tag#16 Add. Sense: Logical unit communication failure Jun 30 19:59:47 vm-10-157 kernel: sd 3:0:0:0: [sdb] tag#16 CDB: Write(10) 2a 00 01 12 00 00 00 20 00 00 Jun 30 19:59:47 vm-10-157 kernel: print_req_error: I/O error, dev sdb, sector 17956864 flags ca01 Jun 30 19:59:47 vm-10-157 kernel: device-mapper: multipath: Failing path 8:16. Jun 30 19:59:48 vm-10-157 multipathd[6527]: sdb: mark as failed Jun 30 19:59:48 vm-10-157 multipathd[6527]: 3600140562e0fe6542ec4cdfb40ffb605: Entering recovery mode: max_retries=4 Jun 30 19:59:48 vm-10-157 multipathd[6527]: 3600140562e0fe6542ec4cdfb40ffb605: remaining active paths: 0 Most likely your storage is misconfigured, this is not the first time this happens. Please see comment 18 on how to setup a working server. This may also show a multipath bug. We expect multipath to retry 4 times and fail the I/O after that, but looks like it cannot handle this case.
Ben, comment 21 show unfortunate case when multipath is in endelss loop, failing and reinstating a path, practically queuing I/O forever. You can check the host /var/log/messages and storage server /var/log/messages in attachment 1699354 [details]. I have no more info on this environment, but Steven may be able to give more info or provide access to the environment for debugging this. Can you take a look?
(In reply to Nir Soffer from comment #21) > Steven, looking in your daemon log, it seems that you are not using latest > upload_disk.py. Please install latest version from: > - fedora: > https://jenkins.ovirt.org/job/ovirt-engine-sdk_standard-on-merge/77/artifact/ > build-artifacts.fc30.x86_64/ > - centos/rhel: > https://jenkins.ovirt.org/job/ovirt-engine-sdk_standard-on-merge/77/artifact/ > build-artifacts.el8.x86_64/ > > Looking in your daemon log: > > 1. A ticket was added > > 2020-06-30 19:56:32,204 INFO (Thread-4) [tickets] [local] ADD > ticket={'dirty': False, 'ops': ['write'], 'size': 10737418240, 'sparse': > False, 'transfer_id': '901212bc-9c9e-4613-83a0-b3ce4256538b', 'uuid': > '2beaa466-7cc1-4acb-8baa-ee763deeffd8', 'timeout': 300, 'url': > 'nbd:unix:/var/run/vdsm/nbd/2beaa466-7cc1-4acb-8baa-ee763deeffd8.sock'} > > 2. Client from ::ffff:10.46.10.158 started a transfer > > 2020-06-30 19:56:32,206 INFO (Thread-4) [http] CLOSE client=local > [connection 1 ops, 0.003885 s] [dispatch 1 ops, 0.002944 s] > 2020-06-30 19:56:33,653 INFO (Thread-5) [http] OPEN > client=::ffff:10.46.10.158 > 2020-06-30 19:56:33,654 INFO (Thread-5) [images] [::ffff:10.46.10.158] > OPTIONS ticket=2beaa466-7cc1-4acb-8baa-ee763deeffd8 > 2020-06-30 19:56:33,655 INFO (Thread-5) [backends.nbd] Open backend > address='/var/run/vdsm/nbd/2beaa466-7cc1-4acb-8baa-ee763deeffd8.sock' > export_name='' max_connections=8 > > But we do't have any more logs for this thread until the end of the log. > > 2020-06-30 19:58:56,905 INFO (Thread-23) [http] CLOSE client=local > [connection 1 ops, 0.000720 s] [dispatch 1 ops, 0.000201 s] > > > So it looks like ovirt-imageio daemon is blocked accessing qemu-nbd, and > qemu-nbd > is blocked on storage. When this happens, the request will timeout on the > client > side. > > When the client times out, it should close the connection, and we should see > a log > for this. Maybe you did not send the entire log? > > > Looking in /var/log/messages, we see: > > 1. qemu-nbd service was started > > xport-name= --cache=none --aio=native > /rhev/data-center/mnt/blockSD/da538992-3aaa-4d59-aa28-00966f88f0dd/images/ > 1f10724e-96a0-4548-a861-97477bb039f2/8ed2e502-a3f3-4489-89b6- > 9c414c7b3c62. > Jun 30 19:56:32 vm-10-157 kernel: device-mapper: core: qemu-nbd: sending > ioctl 5326 to DM device without required privilege. > > 2. Writing to sdb failed > > Jun 30 19:57:02 vm-10-157 kernel: sd 3:0:0:0: [sdb] tag#59 FAILED Result: > hostbyte=DID_OK driverbyte=DRIVER_SENSE > Jun 30 19:57:02 vm-10-157 kernel: sd 3:0:0:0: [sdb] tag#59 Sense Key : Not > Ready [current] > Jun 30 19:57:02 vm-10-157 kernel: sd 3:0:0:0: [sdb] tag#59 Add. Sense: > Logical unit communication failure > Jun 30 19:57:02 vm-10-157 kernel: sd 3:0:0:0: [sdb] tag#59 CDB: Write(10) 2a > 00 01 12 00 00 00 20 00 00 > Jun 30 19:57:02 vm-10-157 kernel: print_req_error: I/O error, dev sdb, > sector 17956864 flags ca01 > Jun 30 19:57:02 vm-10-157 kernel: device-mapper: multipath: Failing path > 8:16. > > 3. Multipath fail the path > > Jun 30 19:57:02 vm-10-157 multipathd[6527]: sdb: mark as failed > Jun 30 19:57:02 vm-10-157 multipathd[6527]: > 3600140562e0fe6542ec4cdfb40ffb605: Entering recovery mode: max_retries=4 > Jun 30 19:57:02 vm-10-157 multipathd[6527]: > 3600140562e0fe6542ec4cdfb40ffb605: remaining active paths: 0 > > 4. 5 seconds later, the path checker reports that the path is ok > and multipath reinstate the path > > Jun 30 19:57:07 vm-10-157 multipathd[6527]: > 3600140562e0fe6542ec4cdfb40ffb605: sdb - tur checker reports path is up > Jun 30 19:57:07 vm-10-157 multipathd[6527]: 8:16: reinstated > Jun 30 19:57:07 vm-10-157 multipathd[6527]: > 3600140562e0fe6542ec4cdfb40ffb605: queue_if_no_path enabled > Jun 30 19:57:07 vm-10-157 multipathd[6527]: > 3600140562e0fe6542ec4cdfb40ffb605: Recovered to normal mode > Jun 30 19:57:07 vm-10-157 multipathd[6527]: > 3600140562e0fe6542ec4cdfb40ffb605: remaining active paths: 1 > Jun 30 19:57:07 vm-10-157 kernel: device-mapper: multipath: Reinstating path > 8:16. > Jun 30 19:57:07 vm-10-157 kernel: sd 3:0:0:0: alua: port group 00 state A > non-preferred supports TOlUSNA > > 6. Writing to the device fail again... > > Jun 30 19:57:07 vm-10-157 kernel: sd 3:0:0:0: [sdb] tag#63 FAILED Result: > hostbyte=DID_OK driverbyte=DRIVER_SENSE > Jun 30 19:57:07 vm-10-157 kernel: sd 3:0:0:0: [sdb] tag#63 Sense Key : Not > Ready [current] > Jun 30 19:57:07 vm-10-157 kernel: sd 3:0:0:0: [sdb] tag#63 Add. Sense: > Logical unit communication failure > Jun 30 19:57:07 vm-10-157 kernel: sd 3:0:0:0: [sdb] tag#63 CDB: Write(10) 2a > 00 01 12 00 00 00 20 00 00 > Jun 30 19:57:07 vm-10-157 kernel: print_req_error: I/O error, dev sdb, > sector 17956864 flags ca01 > Jun 30 19:57:07 vm-10-157 kernel: device-mapper: multipath: Failing path > 8:16. > > 7. Multipath fail the path again > > Jun 30 19:57:08 vm-10-157 multipathd[6527]: sdb: mark as failed > Jun 30 19:57:08 vm-10-157 multipathd[6527]: > 3600140562e0fe6542ec4cdfb40ffb605: Entering recovery mode: max_retries=4 > Jun 30 19:57:08 vm-10-157 multipathd[6527]: > 3600140562e0fe6542ec4cdfb40ffb605: remaining active paths: 0 > > 8. 4 seconds later the path checker reports that the path is ok > and multipath reinstate the path > > Jun 30 19:57:12 vm-10-157 multipathd[6527]: > 3600140562e0fe6542ec4cdfb40ffb605: sdb - tur checker reports path is up > Jun 30 19:57:12 vm-10-157 multipathd[6527]: 8:16: reinstated > Jun 30 19:57:12 vm-10-157 multipathd[6527]: > 3600140562e0fe6542ec4cdfb40ffb605: queue_if_no_path enabled > Jun 30 19:57:12 vm-10-157 multipathd[6527]: > 3600140562e0fe6542ec4cdfb40ffb605: Recovered to normal mode > Jun 30 19:57:12 vm-10-157 multipathd[6527]: > 3600140562e0fe6542ec4cdfb40ffb605: remaining active paths: 1 > Jun 30 19:57:12 vm-10-157 kernel: device-mapper: multipath: Reinstating path > 8:16. > Jun 30 19:57:12 vm-10-157 kernel: sd 3:0:0:0: alua: port group 00 state A > non-preferred supports TOlUSNA > Jun 30 19:57:12 vm-10-157 lvm[486]: pvscan[486] PV > /dev/mapper/3600140562e0fe6542ec4cdfb40ffb605 online, VG > da538992-3aaa-4d59-aa28-00966f88f0dd is complete. > Jun 30 19:57:12 vm-10-157 lvm[486]: pvscan[486] VG > da538992-3aaa-4d59-aa28-00966f88f0dd run autoactivation. > > 9. This dance repeats until the end of the log, practically keeping > the I/O queued forever. > > Jun 30 19:59:43 vm-10-157 multipathd[6527]: sdb: mark as failed > Jun 30 19:59:43 vm-10-157 multipathd[6527]: > 3600140562e0fe6542ec4cdfb40ffb605: Entering recovery mode: max_retries=4 > Jun 30 19:59:43 vm-10-157 multipathd[6527]: > 3600140562e0fe6542ec4cdfb40ffb605: remaining active paths: 0 > Jun 30 19:59:47 vm-10-157 multipathd[6527]: > 3600140562e0fe6542ec4cdfb40ffb605: sdb - tur checker reports path is up > Jun 30 19:59:47 vm-10-157 multipathd[6527]: 8:16: reinstated > Jun 30 19:59:47 vm-10-157 multipathd[6527]: > 3600140562e0fe6542ec4cdfb40ffb605: queue_if_no_path enabled > Jun 30 19:59:47 vm-10-157 multipathd[6527]: > 3600140562e0fe6542ec4cdfb40ffb605: Recovered to normal mode > Jun 30 19:59:47 vm-10-157 multipathd[6527]: > 3600140562e0fe6542ec4cdfb40ffb605: remaining active paths: 1 > Jun 30 19:59:47 vm-10-157 kernel: device-mapper: multipath: Reinstating path > 8:16. > Jun 30 19:59:47 vm-10-157 kernel: sd 3:0:0:0: alua: port group 00 state A > non-preferred supports TOlUSNA > Jun 30 19:59:47 vm-10-157 kernel: sd 3:0:0:0: [sdb] tag#16 FAILED Result: > hostbyte=DID_OK driverbyte=DRIVER_SENSE > Jun 30 19:59:47 vm-10-157 kernel: sd 3:0:0:0: [sdb] tag#16 Sense Key : Not > Ready [current] > Jun 30 19:59:47 vm-10-157 kernel: sd 3:0:0:0: [sdb] tag#16 Add. Sense: > Logical unit communication failure > Jun 30 19:59:47 vm-10-157 kernel: sd 3:0:0:0: [sdb] tag#16 CDB: Write(10) 2a > 00 01 12 00 00 00 20 00 00 > Jun 30 19:59:47 vm-10-157 kernel: print_req_error: I/O error, dev sdb, > sector 17956864 flags ca01 > Jun 30 19:59:47 vm-10-157 kernel: device-mapper: multipath: Failing path > 8:16. > Jun 30 19:59:48 vm-10-157 multipathd[6527]: sdb: mark as failed > Jun 30 19:59:48 vm-10-157 multipathd[6527]: > 3600140562e0fe6542ec4cdfb40ffb605: Entering recovery mode: max_retries=4 > Jun 30 19:59:48 vm-10-157 multipathd[6527]: > 3600140562e0fe6542ec4cdfb40ffb605: remaining active paths: 0 > > > Most likely your storage is misconfigured, this is not the first time this > happens. Please see comment 18 on how to setup a working server. > > This may also show a multipath bug. We expect multipath to retry 4 times > and fail the I/O after that, but looks like it cannot handle this > case. As per my comment [1], this can be simulated when the file is a RAW file and the upload is to storage we are not specifying sparse (so that the Allocation Type is preallocated). Please retest with these parameters. [1] https://bugzilla.redhat.com/show_bug.cgi?id=1851023#c19
(In reply to Nir Soffer from comment #22) > Ben, comment 21 show unfortunate case when multipath is in endelss > loop, failing and reinstating a path, practically queuing I/O forever. > > You can check the host /var/log/messages and storage server > /var/log/messages in attachment 1699354 [details]. > > I have no more info on this environment, but Steven may be > able to give more info or provide access to the environment > for debugging this. > > Can you take a look? As per my comment [1], this can be simulated when the file is a RAW file and the upload is to storage we are not specifying sparse (so that the Allocation Type is preallocated). Please retest with these parameters. Please advise if you need more information. [1] https://bugzilla.redhat.com/show_bug.cgi?id=1851023#c19
(In reply to Steven Rosenberg from comment #23) > As per my comment [1], this can be simulated when the file is a RAW file and > the upload is to storage we are not specifying sparse (so that the > Allocation Type is preallocated). Please retest with these parameters. The type of the input file does not matter, uplod_disk.py is always reading raw data from qemu-nbd, regardless of the file format. Please read comment 21 carefully, and test with a new storage serer.
(In reply to Nir Soffer from comment #25) > (In reply to Steven Rosenberg from comment #23) > > As per my comment [1], this can be simulated when the file is a RAW file and > > the upload is to storage we are not specifying sparse (so that the > > Allocation Type is preallocated). Please retest with these parameters. > > The type of the input file does not matter, uplod_disk.py is always reading > raw data from qemu-nbd, regardless of the file format. > > Please read comment 21 carefully, and test with a new storage serer. The script you referenced doesn't work well: python ./create-target.py -n 1 -s 30 SCSI () Creating target target_name: SCSI target_iqn: iqn.2003-01.org.vm-11-32.SCSI target_dir: /target/SCSI lun_count: 1 lun_size: 30 GiB () Create target? [N/y]: y Traceback (most recent call last): File "./create-target.py", line 66, in <module> reply = input("Create target? [N/y]: ") File "<string>", line 1, in <module> NameError: name 'y' is not defined Please advise on the options and maybe we can just fix the other one through the targetcli? Much of the previous script's setup seems to work. Or do you have another suggestion?
The script starts with: #!/usr/bin/python3 And it should run like this (from vdsm source directory): $ contrib/target-tools/create-target As shown clearly in comment 18. Why did you run it with python 2? python 2 was sunset on January 1st, 2020: https://www.python.org/doc/sunset-python-2/
(In reply to Nir Soffer from comment #27) > The script starts with: > > #!/usr/bin/python3 > > And it should run like this (from vdsm source directory): > > $ contrib/target-tools/create-target > > As shown clearly in comment 18. > > Why did you run it with python 2? > > python 2 was sunset on January 1st, 2020: > https://www.python.org/doc/sunset-python-2/ I am running it on centos 7.6. Do you recommend centos 8.1?
I'm using Fedora 31 for my servers. If you prefer CentOS, I would use the latest version (8.2.2004).
(In reply to Nir Soffer from comment #29) > I'm using Fedora 31 for my servers. If you prefer CentOS, I would use > the latest version (8.2.2004). I run the script, it fails: python3 contrib/target-tools/create-target -n 1 -s 30 SCSI Creating target target_name: SCSI target_iqn: iqn.2003-01.org.vm-11-32.SCSI target_dir: /target/SCSI lun_count: 1 lun_size: 30 GiB Create target? [N/y]: y Creating target directory '/target/SCSI' Creating target 'iqn.2003-01.org.vm-11-32.SCSI' Warning: Could not load preferences file /root/.targetcli/prefs.bin. Created target iqn.2003-01.org.vm-11-32.scsi. Created TPG 1. Global pref auto_add_default_portal=true Created default portal listening on all IPs (0.0.0.0), port 3260. Setting permissions (any host can access this target) No such path /iscsi/iqn.2003-01.org.vm-11-32.SCSI Traceback (most recent call last): File "contrib/target-tools/create-target", line 83, in <module> "cache_dynamic_acls=1"]) File "/usr/lib64/python3.6/subprocess.py", line 311, in check_call raise CalledProcessError(retcode, cmd) subprocess.CalledProcessError: Command '['targetcli', '/iscsi/iqn.2003-01.org.vm-11-32.SCSI/tpg1', 'set', 'attribute', 'authentication=0', 'demo_mode_write_protect=0', 'generate_node_acls=1', 'cache_dynamic_acls=1']' returned non-zero exit status 1. What is supposed to create the path: /iscsi/iqn.2003-01.org.vm-11-32.SCSI
(In reply to Steven Rosenberg from comment #30) > (In reply to Nir Soffer from comment #29) > > I'm using Fedora 31 for my servers. If you prefer CentOS, I would use > > the latest version (8.2.2004). > > I run the script, it fails: > > python3 contrib/target-tools/create-target -n 1 -s 30 SCSI > > Creating target > target_name: SCSI > target_iqn: iqn.2003-01.org.vm-11-32.SCSI > target_dir: /target/SCSI > lun_count: 1 > lun_size: 30 GiB > > Create target? [N/y]: y > Creating target directory '/target/SCSI' > Creating target 'iqn.2003-01.org.vm-11-32.SCSI' > Warning: Could not load preferences file /root/.targetcli/prefs.bin. > Created target iqn.2003-01.org.vm-11-32.scsi. > Created TPG 1. > Global pref auto_add_default_portal=true > Created default portal listening on all IPs (0.0.0.0), port 3260. > Setting permissions (any host can access this target) > No such path /iscsi/iqn.2003-01.org.vm-11-32.SCSI I don't have this warning... > Traceback (most recent call last): > File "contrib/target-tools/create-target", line 83, in <module> > "cache_dynamic_acls=1"]) > File "/usr/lib64/python3.6/subprocess.py", line 311, in check_call > raise CalledProcessError(retcode, cmd) > subprocess.CalledProcessError: Command '['targetcli', > '/iscsi/iqn.2003-01.org.vm-11-32.SCSI/tpg1', 'set', 'attribute', > 'authentication=0', 'demo_mode_write_protect=0', 'generate_node_acls=1', > 'cache_dynamic_acls=1']' returned non-zero exit status 1. And targetcli does not fail. > What is supposed to create the path: /iscsi/iqn.2003-01.org.vm-11-32.SCSI I think this warning is incorrect and the issue is that /target is missing. On my scsi server: $ ls -lh /home/target/ total 8.0K drwxr-xr-x. 2 root root 4.0K Dec 26 2019 1 drwxr-xr-x. 2 root root 4.0K Apr 19 22:58 2 $ sudo contrib/target-tools/create-target --lun-size 30 --lun-count 1 --root-dir /home/target test Creating target target_name: test target_iqn: iqn.2003-01.org.sparse.test target_dir: /home/target/test lun_count: 1 lun_size: 30 GiB Create target? [N/y]: y Creating target directory '/home/target/test' Creating target 'iqn.2003-01.org.sparse.test' Created target iqn.2003-01.org.sparse.test. Created TPG 1. Global pref auto_add_default_portal=true Created default portal listening on all IPs (0.0.0.0), port 3260. Setting permissions (any host can access this target) Parameter authentication is now '0'. Parameter demo_mode_write_protect is now '0'. Parameter generate_node_acls is now '1'. Parameter cache_dynamic_acls is now '1'. Creating disks Creating backing file '/home/target/test/00' Creating backing store '/backstores/fileio/test-00' Created fileio test-00 with size 32212254720 Parameter emulate_tpu is now '1'. Parameter emulate_tpws is now '1'. Parameter max_write_same_len is now '65335'. Adding lun for '/backstores/fileio/test-00' Created LUN 0. Saving configuration Last 10 configs saved in /etc/target/backup/. Configuration saved to /etc/target/saveconfig.json Target added successfully $ targetcli ls /iscsi/iqn.2003-01.org.sparse.test o- iqn.2003-01.org.sparse.test ........................................................................................... [TPGs: 1] o- tpg1 ...................................................................................................... [gen-acls, no-auth] o- acls .............................................................................................................. [ACLs: 0] o- luns .............................................................................................................. [LUNs: 1] | o- lun0 ........................................................... [fileio/test-00 (/home/target/test/00) (default_tg_pt_gp)] o- portals ........................................................................................................ [Portals: 1] o- 0.0.0.0:3260 ......................................................................................................... [OK] $ targetcli ls /backstores/fileio/test-00 o- test-00 ................................................................... [/home/target/test/00 (30.0GiB) write-back activated] o- alua ......................................................................................................... [ALUA Groups: 1] o- default_tg_pt_gp ............................................................................. [ALUA state: Active/optimized] You can remove the target later with: $ sudo contrib/target-tools/delete-target --lun-count 1 --root-dir /home/target test Deleting target target_name: test target_iqn: iqn.2003-01.org.sparse.test target_dir: /home/target/test lun_count: 1 Delete target? [N/y]: y Deleting disks Deleting backing store '/backstores/fileio/test-00' Deleted storage object test-00. Deleting target 'iqn.2003-01.org.sparse.test' Deleted Target iqn.2003-01.org.sparse.test. Removing target directory '/home/target/test' Saving configuration Last 10 configs saved in /etc/target/backup/. Configuration saved to /etc/target/saveconfig.json Target deleted successfully
Tested also on centos 8.2, works correctly. On which system do you run this? I tested with: centos: $ rpm -q targetcli targetcli-2.1.51-4.el8_2.noarch fedora: $ rpm -q targetcli targetcli-2.1.fb49-4.fc31.noarch
This: Jun 30 19:59:47 vm-10-157 multipathd[6527]: 3600140562e0fe6542ec4cdfb40ffb605: sdb - tur checker reports path is up Jun 30 19:59:47 vm-10-157 multipathd[6527]: 8:16: reinstated Jun 30 19:59:47 vm-10-157 multipathd[6527]: 3600140562e0fe6542ec4cdfb40ffb605: queue_if_no_path enabled Jun 30 19:59:47 vm-10-157 multipathd[6527]: 3600140562e0fe6542ec4cdfb40ffb605: Recovered to normal mode Jun 30 19:59:47 vm-10-157 multipathd[6527]: 3600140562e0fe6542ec4cdfb40ffb605: remaining active paths: 1 Jun 30 19:59:47 vm-10-157 kernel: device-mapper: multipath: Reinstating path 8:16. Jun 30 19:59:47 vm-10-157 kernel: sd 3:0:0:0: alua: port group 00 state A non-preferred supports TOlUSNA Jun 30 19:59:47 vm-10-157 kernel: sd 3:0:0:0: [sdb] tag#16 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE Jun 30 19:59:47 vm-10-157 kernel: sd 3:0:0:0: [sdb] tag#16 Sense Key : Not Ready [current] Jun 30 19:59:47 vm-10-157 kernel: sd 3:0:0:0: [sdb] tag#16 Add. Sense: Logical unit communication failure Jun 30 19:59:47 vm-10-157 kernel: sd 3:0:0:0: [sdb] tag#16 CDB: Write(10) 2a 00 01 12 00 00 00 20 00 00 Jun 30 19:59:47 vm-10-157 kernel: print_req_error: I/O error, dev sdb, sector 17956864 flags ca01 Jun 30 19:59:47 vm-10-157 kernel: device-mapper: multipath: Failing path 8:16. Jun 30 19:59:48 vm-10-157 multipathd[6527]: sdb: mark as failed Jun 30 19:59:48 vm-10-157 multipathd[6527]: 3600140562e0fe6542ec4cdfb40ffb605: Entering recovery mode: max_retries=4 Jun 30 19:59:48 vm-10-157 multipathd[6527]: 3600140562e0fe6542ec4cdfb40ffb605: remaining active paths: 0 looks weird to me. Clearly multipathd's tur checker passed, but looking that the kernel output from the IO error immediately afterwards, this: Jun 30 19:59:47 vm-10-157 kernel: sd 3:0:0:0: [sdb] tag#16 Sense Key : Not Ready [current] make is look like the path isn't in the Ready state, which *should* cause the tur checker to fail. Ewan, am I misreading the kernel output? Is this reproducible? If so, when this happens, can you run # sg_turs <ping_ponging_scsi_device> ; echo $? and post the output. Either I'm misunderstanding the kernel output, or for some reason I don't understand, scsi would report Not Ready when IO happens, but not for a TUR command, or there's something wrong with the tur checker. Hopefully the sg_turs output will point to what's actually going on here.
It depends upon what the underlying storage transport is. ASC/ASCQ 08h/00h LOGICAL UNIT COMMUNICATION FAILURE could be coming from QEMU, as there is a defined status code for that, and it is not common to see that coming from storage devices I am familiar with: /* Command aborted, LUN Communication Failure */ const struct SCSISense sense_code_LUN_COMM_FAILURE = { .key = ABORTED_COMMAND, .asc = 0x08, .ascq = 0x00 }; However the 2 places I can find use this with a key of CHECK_CONDITION, instead of NOT_READY: case MPATH_PR_OTHER: default: scsi_build_sense(sense, SENSE_CODE(LUN_COMM_FAILURE)); return CHECK_CONDITION; } } out: if (ret < 0) { int sense_len = scsi_build_sense(io_hdr->sbp, SENSE_CODE(LUN_COMM_FAILURE)); io_hdr->driver_status = SG_ERR_DRIVER_SENSE; io_hdr->sb_len_wr = MIN(io_hdr->mx_sb_len, sense_len); io_hdr->status = CHECK_CONDITION; } ---- It is possible that the storage stack is returning this NOT READY / LOGICAL UNIT COMMUNICATION FAILURE on data transfer operations like the WRITE(10) here, but not on TEST UNIT READY commands. Can the multipath configuration be changed to not use the TUR checker?
(In reply to Ewan D. Milne from comment #34) > It is possible that the storage stack is returning this NOT READY / LOGICAL > UNIT COMMUNICATION FAILURE > on data transfer operations like the WRITE(10) here, but not on TEST UNIT > READY commands. Can the > multipath configuration be changed to not use the TUR checker? You can easily set it to use the directio checker. If you added: path_checker "directio" to the relevant devices config, or the overrides section, that would do it. But it would be nice to figure out why it's causing a problem in this setup.
(In reply to Nir Soffer from comment #32) > Tested also on centos 8.2, works correctly. > > On which system do you run this? > > I tested with: > > centos: > $ rpm -q targetcli > targetcli-2.1.51-4.el8_2.noarch > > fedora: > $ rpm -q targetcli > targetcli-2.1.fb49-4.fc31.noarch I was able to make the SCSI installation work with centos 8.2. The issues previously are with other OS distributions which the script is not supporting including centos 8.1 and earlier. This worked with the iqn.2003-01.org.vm-11-32.test path: sudo contrib/target-tools/create-target --lun-size 30 --lun-count 1 --root-dir /home/target test Creating target target_name: test target_iqn: iqn.2003-01.org.vm-11-32.test target_dir: /home/target/test lun_count: 1 lun_size: 30 GiB Create target? [N/y]: y Creating target directory '/home/target/test' Creating target 'iqn.2003-01.org.vm-11-32.test' Created target iqn.2003-01.org.vm-11-32.test. Created TPG 1. Global pref auto_add_default_portal=true Created default portal listening on all IPs (0.0.0.0), port 3260. Setting permissions (any host can access this target) Parameter authentication is now '0'. Parameter demo_mode_write_protect is now '0'. Parameter generate_node_acls is now '1'. Parameter cache_dynamic_acls is now '1'. Creating disks Creating backing file '/home/target/test/00' Creating backing store '/backstores/fileio/test-00' Created fileio test-00 with size 32212254720 Parameter emulate_tpu is now '1'. Parameter emulate_tpws is now '1'. Parameter max_write_same_len is now '65335'. Adding lun for '/backstores/fileio/test-00' Created LUN 0. Saving configuration Last 10 configs saved in /etc/target/backup/. Configuration saved to /etc/target/saveconfig.json Target added successfully targetcli ls /iscsi/iqn.2003-01.org.vm-11-32.test o- iqn.2003-01.org.vm-11-32.test ........................................................................... [TPGs: 1] o- tpg1 ........................................................................................ [gen-acls, no-auth] o- acls ................................................................................................ [ACLs: 0] o- luns ................................................................................................ [LUNs: 1] | o- lun0 ............................................. [fileio/test-00 (/home/target/test/00) (default_tg_pt_gp)] o- portals .......................................................................................... [Portals: 1] o- 0.0.0.0:3260 ........................................................................................... [OK] targetcli ls /backstores/fileio/test-00 o- test-00 ..................................................... [/home/target/test/00 (30.0GiB) write-back activated] o- alua ........................................................................................... [ALUA Groups: 1] o- default_tg_pt_gp ............................................................... [ALUA state: Active/optimized] [root@vm-11-32 vdsm]# contrib/target-tools/create-target -h However, the upload still fails on the timeout: sudo python3 ./upload_disk.py /ova/rhl8.0.img --engine-url http://localhost:8080 --username admin@internal --sd-name SCSI -c /home/srosenbe/ovirt_engine_master/etc/pki/ovirt-engine/ca.pem Checking image... Image format: raw Disk format: raw Disk content type: data Disk provisioned size: 10737418240 Disk initial size: 10737418240 Disk name: rhl8.0.raw Disk backup: False Connecting... Password: Creating disk... Disk ID: c414bef8-91f3-4c5a-802e-aeb8d544d41f Creating image transfer... Transfer ID: ad395957-e500-4c96-8816-ba555605ac9a Transfer host name: Host2 Uploading image... [ 36.25% ] 3.62 GiB, 85.08 seconds, 43.63 MiB/s Traceback (most recent call last): File "./upload_disk.py", line 323, in <module> **extra_args) File "/usr/lib64/python3.6/site-packages/ovirt_imageio/client/_api.py", line 69, in upload io.copy(src, dst, buffer_size=buffer_size, progress=progress) File "/usr/lib64/python3.6/site-packages/ovirt_imageio/_internal/io.py", line 38, in copy _copy_data(src, dst, buf, zero, progress) File "/usr/lib64/python3.6/site-packages/ovirt_imageio/_internal/io.py", line 46, in _copy_data _copy_extent(src, dst, ext, buf, progress) File "/usr/lib64/python3.6/site-packages/ovirt_imageio/_internal/io.py", line 67, in _copy_extent dst.read_from(src, length, buf) File "/usr/lib64/python3.6/site-packages/ovirt_imageio/_internal/backends/http.py", line 105, in read_from self._con.send(view[:n]) File "/usr/lib64/python3.6/http/client.py", line 996, in send self.sock.sendall(data) File "/usr/lib64/python3.6/ssl.py", line 931, in sendall v = self.send(byte_view[count:]) File "/usr/lib64/python3.6/ssl.py", line 900, in send return self._sslobj.write(data) File "/usr/lib64/python3.6/ssl.py", line 598, in write return self._sslobj.write(data) socket.timeout: The write operation timed out As per Comment 16, you are testing with a file in "format raw fedora-32.qcow2", while the file I am testing with is in the format: Image format: raw Disk format: raw This is the snippet from Comment 16 of interest: /upload_disk.py --engine-url https://engine3 --username admin@internal --password-file engine3-password --cafile engine3.pem --sd-name iscsi1 --disk-format raw fedora-32.qcow2 Checking image... Image format: qcow2 Disk format: raw Disk content type: data Therefore, the issue doesn't seem to be the SCSI Server, but the file format itself and it seems we are not supporting the file format of raw, Block Storage, non-sparseness (Preallocated) combination as we should. Please advise where you would like me to provide my file so that we can synchronize our testing. Thank you.
(In reply to Steven Rosenberg from comment #36) > (In reply to Nir Soffer from comment #32) > I was able to make the SCSI installation work with centos 8.2. Great. centos 8.2 is basically the only version that matters now. > However, the upload still fails on the timeout: > sudo python3 ./upload_disk.py /ova/rhl8.0.img --engine-url http://localhost:8080 > --username admin@internal --sd-name SCSI ... Are you using the new iscsi domain or the old domain? This looks like the same command used in comment 14. ... > Traceback (most recent call last): > File "./upload_disk.py", line 323, in <module> > **extra_args) ... > File "/usr/lib64/python3.6/site-packages/ovirt_imageio/_internal/backends/http.py", line 105, in read_from > self._con.send(view[:n]) > File "/usr/lib64/python3.6/http/client.py", line 996, in send ... > return self._sslobj.write(data) > socket.timeout: The write operation timed out Looks like you the host is not accessible, and upload_disk fallback to use the proxy, or more likely, you are using some old version of upload_disk.py that always used the proxy. Please make sure you are using the latest version from ovirt-engine-sdk. Install the latest version of python3-ovirt-engine-sdk4, on my centos 8.2 I have now: # rpm -q python3-ovirt-engine-sdk4 python3-ovirt-engine-sdk4-4.4.4-1.20200702git60cd533.el8.x86_64 Use upload_disk from: /usr/share/doc/python3-ovirt-engine-sdk4/examples/upload_disk.py If latest version of upload_disk.py fails in the same way, we need to check your storage and host. > As per Comment 16, you are testing with a file in "format raw > fedora-32.qcow2", while the file I am testing with is in the format: > > Image format: raw > Disk format: raw > > This is the snippet from Comment 16 of interest: ... > Image format: qcow2 > Disk format: raw I already answered this, image format does not matter. client.upload() reads raw data from qemu-nbd, uploading qcow2 or qcow2 compressed or raw are the same. To understand why you have timeouts we need logs from the host doing the upload (host2 based on comment 36), but before we check the logs we need to be sure that your storage works. Please try this: 1. Test that you can create preallocated 10g disk on your iscsi server: fallocate --length 10g /home/target/prealloc-test rm -f /home/target/prealloc-test Does it work? 2. Remove the misconfigured iscsi server that reproduced the timeout previously. 3. reboot the host, since it may in unrecoverable state (multipath endless loop). 4. Test that you can write 10g of data to your 30g lun from the host I assume you created a storage domain form the new 30g lun. Create a new 10g preallocated disk on this storage domain. Find the disk uuid in engine UI. Get Find logical volume on the host using: lvs --config 'devices { filter = ["a|.*|"] }' -o vg_name,lv_name,tags | grep disk-id This will print the vg_name and lv_name Example on my setup: # lvs --config 'devices { filter = ["a|.*|"] }' -o vg_name,lv_name,tags | grep ffca566d-03f2-466d-b1d7-c115928dd8dc 0ce9ba3e-68c0-41fa-888e-75ca36ca6452 f7bd400e-c166-4543-8882-78f61a357f61 IU_ffca566d-03f2-466d-b1d7-c115928dd8dc,MD_1,PU_00000000-0000-0000-0000-000000000000 Activate the logical volume: lvchange --config 'devices { filter = ["a|.*|"] }' -ay {vg_name}/{lv_name} Example on my setup: # lvchange --config 'devices { filter = ["a|.*|"] }' -ay 0ce9ba3e-68c0-41fa-888e-75ca36ca6452/f7bd400e-c166-4543-8882-78f61a357f61 # ls -lh /dev/0ce9ba3e-68c0-41fa-888e-75ca36ca6452/f7bd400e-c166-4543-8882-78f61a357f61 lrwxrwxrwx. 1 root root 8 Jul 5 22:31 /dev/0ce9ba3e-68c0-41fa-888e-75ca36ca6452/f7bd400e-c166-4543-8882-78f61a357f61 -> ../dm-26 Now try to write 10g of data to this volume: dd if=/dev/zero \ bs=8M \ count=1280 \ of=/dev/{vg_name}/{lv_name} \ oflag=direct \ conv=fsync \ status=progress Example on my setup: # dd if=/dev/zero bs=8M count=1280 of=/dev/0ce9ba3e-68c0-41fa-888e-75ca36ca6452/f7bd400e-c166-4543-8882-78f61a357f61 oflag=direct conv=fsync status=progress 10712252416 bytes (11 GB, 10 GiB) copied, 46 s, 232 MB/s 1280+0 records in 1280+0 records out 10737418240 bytes (11 GB, 10 GiB) copied, 50.1554 s, 214 MB/s Does it work? 5. Test that you can zero zero the volume with blkdiscard time blkdiscard --zeroout --step 1g /dev/{vg_name}/{lv_name} Example on my setup: # time blkdiscard --zeroout --step 1g /dev/0ce9ba3e-68c0-41fa-888e-75ca36ca6452/f7bd400e-c166-4543-8882-78f61a357f61 real 0m36.467s user 0m0.000s sys 0m0.004s Does it work? 6. Test that you can zero the volume with falloate time fallocate --punch-hole --length 10g /dev/{vg_name}/{lv_name} # time fallocate --punch-hole --length 10g /dev/0ce9ba3e-68c0-41fa-888e-75ca36ca6452/f7bd400e-c166-4543-8882-78f61a357f61 real 0m42.005s user 0m0.001s sys 0m0.003s Does it work? 7. Test that you can copy image using qemu-img convert First create a test image: LIBGUESTFS_BACKEND=direct virt-builder fedora-31 -o /var/tmp/test.img --size 10G Example on my setup: # LIBGUESTFS_BACKEND=direct virt-builder fedora-31 -o /var/tmp/test.img --size 10G [ 1.3] Downloading: http://builder.libguestfs.org/fedora-31.xz [ 1.9] Planning how to build this image [ 1.9] Uncompressing [ 11.9] Resizing (using virt-resize) to expand the disk to 10.0G [ 51.1] Opening the new disk [ 57.5] Setting a random seed [ 57.5] Setting passwords virt-builder: Setting random password of root to 4dIgv5A795oryz41 [ 58.8] Finishing off Output file: /var/tmp/test.img Output size: 10.0G Output format: raw Total usable space: 9.4G Free space: 8.1G (86%) Try to copy it using qemu-img convert: time qemu-img convert \ -p \ -f raw \ -O raw \ -t none \ -T none \ -W \ /var/tmp/test.img \ /dev/{vg_name}/{lv_name} Example on my setup: # time qemu-img convert \ > -p \ > -f raw \ > -O raw \ > -t none \ > -T none \ > -W \ > /var/tmp/test.img \ > /dev/0ce9ba3e-68c0-41fa-888e-75ca36ca6452/f7bd400e-c166-4543-8882-78f61a357f61 (100.00/100%) real 0m54.200s user 0m0.128s sys 0m0.247s Does it work? 8. Test that you can copy the image using qemu-nbd (This is how ovirt-imageio write data to storage) Start qemu-nbd in one shell: qemu-nbd --socket=/tmp/nbd.sock \ --persistent \ --shared=8 \ --format=raw \ --aio=native \ --cache=none \ /dev/{vg_name}/{lv_name} In another shell copy the test image to qemu-nbd socket: time qemu-img convert \ -p \ -f raw \ -O raw \ -t none \ -T none \ -W \ -n \ /var/tmp/test.img \ nbd:unix:/tmp/nbd.sock Example on my setup: # time qemu-img convert \ > -p \ > -f raw \ > -O raw \ > -t none \ > -T none \ > -W \ > -n \ > /var/tmp/test.img \ > nbd:unix:/tmp/nbd.sock (100.00/100%) real 0m48.004s user 0m0.185s sys 0m0.363s Does it work? 9. Finally deactivate the logical volume: lvchange --config 'devices { filter = ["a|.*|"] }' -an {vg_name}/{lv_name} 10. Delete the disk in engine. If all the tests above work, and upload still fails, the issue may be in imageio, and I will look at it.
(In reply to Nir Soffer from comment #37) > (In reply to Steven Rosenberg from comment #36) > > (In reply to Nir Soffer from comment #32) > > I was able to make the SCSI installation work with centos 8.2. > > Great. centos 8.2 is basically the only version that matters now. > > > However, the upload still fails on the timeout: > > sudo python3 ./upload_disk.py /ova/rhl8.0.img --engine-url http://localhost:8080 > > --username admin@internal --sd-name SCSI ... > > Are you using the new iscsi domain or the old domain? Of cause the new one, the old one no longer exists. > > This looks like the same command used in comment 14. > > ... > > Traceback (most recent call last): > > File "./upload_disk.py", line 323, in <module> > > **extra_args) > ... > > File "/usr/lib64/python3.6/site-packages/ovirt_imageio/_internal/backends/http.py", line 105, in read_from > > self._con.send(view[:n]) > > File "/usr/lib64/python3.6/http/client.py", line 996, in send > ... > > return self._sslobj.write(data) > > socket.timeout: The write operation timed out > > Looks like you the host is not accessible, and upload_disk fallback to > use the proxy, or more likely, you are using some old version of > upload_disk.py that always used the proxy. > > Please make sure you are using the latest version from ovirt-engine-sdk. I re-based the sdk and rebuilt it before the testing. > > Install the latest version of python3-ovirt-engine-sdk4, on my centos 8.2 > I have now: > > # rpm -q python3-ovirt-engine-sdk4 > python3-ovirt-engine-sdk4-4.4.4-1.20200702git60cd533.el8.x86_64 > > Use upload_disk from: > > /usr/share/doc/python3-ovirt-engine-sdk4/examples/upload_disk.py > > If latest version of upload_disk.py fails in the same way, we need > to check your storage and host. > > > As per Comment 16, you are testing with a file in "format raw > > fedora-32.qcow2", while the file I am testing with is in the format: > > > > Image format: raw > > Disk format: raw > > > > This is the snippet from Comment 16 of interest: > ... > > Image format: qcow2 > > Disk format: raw > > I already answered this, image format does not matter. client.upload() > reads raw data from qemu-nbd, uploading qcow2 or qcow2 compressed or raw > are the same. > > To understand why you have timeouts we need logs from the host doing > the upload (host2 based on comment 36), but before we check the logs > we need to be sure that your storage works. It is the storage you recommended as per my previous comments. To save time, please provide me with your image file that you are uploading so that we can synchronize, before I start debugging the environment further. Thank you. > > Please try this: > > > 1. Test that you can create preallocated 10g disk on your iscsi > server: > > fallocate --length 10g /home/target/prealloc-test > > rm -f /home/target/prealloc-test > > Does it work? > > > 2. Remove the misconfigured iscsi server that reproduced the timeout > previously. > > > 3. reboot the host, since it may in unrecoverable state > (multipath endless loop). > > > 4. Test that you can write 10g of data to your 30g lun from the host > > I assume you created a storage domain form the new 30g lun. > > Create a new 10g preallocated disk on this storage domain. > > Find the disk uuid in engine UI. > > Get Find logical volume on the host using: > > lvs --config 'devices { filter = ["a|.*|"] }' -o vg_name,lv_name,tags | > grep disk-id > > This will print the vg_name and lv_name > > Example on my setup: > > # lvs --config 'devices { filter = ["a|.*|"] }' -o vg_name,lv_name,tags | > grep ffca566d-03f2-466d-b1d7-c115928dd8dc > 0ce9ba3e-68c0-41fa-888e-75ca36ca6452 f7bd400e-c166-4543-8882-78f61a357f61 > IU_ffca566d-03f2-466d-b1d7-c115928dd8dc,MD_1,PU_00000000-0000-0000-0000- > 000000000000 > > Activate the logical volume: > > lvchange --config 'devices { filter = ["a|.*|"] }' -ay > {vg_name}/{lv_name} > > Example on my setup: > > # lvchange --config 'devices { filter = ["a|.*|"] }' -ay > 0ce9ba3e-68c0-41fa-888e-75ca36ca6452/f7bd400e-c166-4543-8882-78f61a357f61 > > # ls -lh > /dev/0ce9ba3e-68c0-41fa-888e-75ca36ca6452/f7bd400e-c166-4543-8882- > 78f61a357f61 > lrwxrwxrwx. 1 root root 8 Jul 5 22:31 > /dev/0ce9ba3e-68c0-41fa-888e-75ca36ca6452/f7bd400e-c166-4543-8882- > 78f61a357f61 -> ../dm-26 > > Now try to write 10g of data to this volume: > > dd if=/dev/zero \ > bs=8M \ > count=1280 \ > of=/dev/{vg_name}/{lv_name} \ > oflag=direct \ > conv=fsync \ > status=progress > > Example on my setup: > > # dd if=/dev/zero bs=8M count=1280 > of=/dev/0ce9ba3e-68c0-41fa-888e-75ca36ca6452/f7bd400e-c166-4543-8882- > 78f61a357f61 oflag=direct conv=fsync status=progress > 10712252416 bytes (11 GB, 10 GiB) copied, 46 s, 232 MB/s > 1280+0 records in > 1280+0 records out > 10737418240 bytes (11 GB, 10 GiB) copied, 50.1554 s, 214 MB/s > > Does it work? > > > 5. Test that you can zero zero the volume with blkdiscard > > time blkdiscard --zeroout --step 1g /dev/{vg_name}/{lv_name} > > Example on my setup: > > # time blkdiscard --zeroout --step 1g > /dev/0ce9ba3e-68c0-41fa-888e-75ca36ca6452/f7bd400e-c166-4543-8882- > 78f61a357f61 > > real 0m36.467s > user 0m0.000s > sys 0m0.004s > > Does it work? > > > 6. Test that you can zero the volume with falloate > > time fallocate --punch-hole --length 10g /dev/{vg_name}/{lv_name} > > # time fallocate --punch-hole --length 10g > /dev/0ce9ba3e-68c0-41fa-888e-75ca36ca6452/f7bd400e-c166-4543-8882- > 78f61a357f61 > > real 0m42.005s > user 0m0.001s > sys 0m0.003s > > > Does it work? > > > 7. Test that you can copy image using qemu-img convert > > First create a test image: > > LIBGUESTFS_BACKEND=direct virt-builder fedora-31 -o /var/tmp/test.img > --size 10G > > Example on my setup: > > # LIBGUESTFS_BACKEND=direct virt-builder fedora-31 -o /var/tmp/test.img > --size 10G > [ 1.3] Downloading: http://builder.libguestfs.org/fedora-31.xz > [ 1.9] Planning how to build this image > [ 1.9] Uncompressing > [ 11.9] Resizing (using virt-resize) to expand the disk to 10.0G > [ 51.1] Opening the new disk > [ 57.5] Setting a random seed > [ 57.5] Setting passwords > virt-builder: Setting random password of root to 4dIgv5A795oryz41 > [ 58.8] Finishing off > Output file: /var/tmp/test.img > Output size: 10.0G > Output format: raw > Total usable space: 9.4G > Free space: 8.1G (86%) > > Try to copy it using qemu-img convert: > > time qemu-img convert \ > -p \ > -f raw \ > -O raw \ > -t none \ > -T none \ > -W \ > /var/tmp/test.img \ > /dev/{vg_name}/{lv_name} > > Example on my setup: > > # time qemu-img convert \ > > -p \ > > -f raw \ > > -O raw \ > > -t none \ > > -T none \ > > -W \ > > /var/tmp/test.img \ > > /dev/0ce9ba3e-68c0-41fa-888e-75ca36ca6452/f7bd400e-c166-4543-8882-78f61a357f61 > (100.00/100%) > > real 0m54.200s > user 0m0.128s > sys 0m0.247s > > > Does it work? > > > 8. Test that you can copy the image using qemu-nbd > > (This is how ovirt-imageio write data to storage) > > Start qemu-nbd in one shell: > > qemu-nbd --socket=/tmp/nbd.sock \ > --persistent \ > --shared=8 \ > --format=raw \ > --aio=native \ > --cache=none \ > /dev/{vg_name}/{lv_name} > > In another shell copy the test image to qemu-nbd socket: > > time qemu-img convert \ > -p \ > -f raw \ > -O raw \ > -t none \ > -T none \ > -W \ > -n \ > /var/tmp/test.img \ > nbd:unix:/tmp/nbd.sock > > Example on my setup: > > # time qemu-img convert \ > > -p \ > > -f raw \ > > -O raw \ > > -t none \ > > -T none \ > > -W \ > > -n \ > > /var/tmp/test.img \ > > nbd:unix:/tmp/nbd.sock > (100.00/100%) > > real 0m48.004s > user 0m0.185s > sys 0m0.363s > > > Does it work? > > > 9. Finally deactivate the logical volume: > > lvchange --config 'devices { filter = ["a|.*|"] }' -an > {vg_name}/{lv_name} > > 10. Delete the disk in engine. > > > If all the tests above work, and upload still fails, the issue may be > in imageio, and I will look at it.
(In reply to Steven Rosenberg from comment #38) > > To understand why you have timeouts we need logs from the host doing > > the upload (host2 based on comment 36), but before we check the logs > > we need to be sure that your storage works. > > It is the storage you recommended as per my previous comments. To save time, > please provide me with your image file that you are uploading so that we can > synchronize, before I start debugging the environment further. Thank you. I'm testing with image created by: virt-builder fedora-31 -o fedora-31.raw --size 10G
This issue was addressed via updating dependencies, running on only the newest versions of the os as well as additional steps needed from Comment 31 for creating the scsi storage domain. This includes adding an additional command in order to extend the proper amount of memory [1]. Better error handling and diagnostics within the system as a whole will be reported as separate issues. [1] contrib/target-tools/create-target --lun-size 20 --lun-count 1 --root-dir /target scsi
The issue was clearly user error. Steven created a target on a file system with 3.6g free space, and then tried to upload a 10g preallocated disk. This cannot work even if we try very hard.