Created attachment 1255829 [details] engine, vdsm and supervdsm logs Description of problem: 1. Created a vm with thin provisioned disk on iscsi storage domain (over xtremIO). 2. Started the vm and created some large file on it and the removed the file. 3. Stopped the vm in order to invoke the sparsify action. 4. Around 2 seconds after invoking the sparsify action I blocked the connection between the host and all iscsi servers that provide the lun from which the vm's diks was created, running on the host: iptables -A OUTPUT -d a.b.c.x,a.b.c.y,a.b.c.z -j DROP. The sparsify action failed as expected after it's timeout has exceeded. 5. Then I removed the iptables rule and verified connection has been established again. 6. Then start the vm Failed. From engine.log: 2017-02-20 18:43:21,024+02 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.CreateVDSCommand] (org.ovirt.thread.pool-6-thread-33) [cbecf807-bc22-4f91-ba81-61e1f69de20d] FINISH, CreateVDSCommand, log id: 33df812c 2017-02-20 18:43:21,030+02 INFO [org.ovirt.engine.core.vdsbroker.CreateVmVDSCommand] (org.ovirt.thread.pool-6-thread-33) [cbecf807-bc22-4f91-ba81-61e1f69de20d] FINISH, CreateVmVDSCommand, return: WaitForLaunch, log id: 77e3051e 2017-02-20 18:43:21,032+02 INFO [org.ovirt.engine.core.bll.RunVmCommand] (org.ovirt.thread.pool-6-thread-33) [cbecf807-bc22-4f91-ba81-61e1f69de20d] Lock freed to object 'EngineLock:{exclusiveLocks='[d693a7cd-212f-43ac-b447-e94c368bf2ef=<VM, ACTION_TYPE_FAILED_OBJECT_LOCKED>]', sharedLocks='null'}' 2017-02-20 18:43:21,073+02 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-6-thread-33) [cbecf807-bc22-4f91-ba81-61e1f69de20d] EVENT_ID: USER_STARTED_VM(153), Correlation ID: cbecf807-bc22-4f91-ba81-61e1f69de20d, Job ID: 0e21ec01-e1a2-4952-bca6-3ae9604edaa2, Call Stack: null, Custom Event ID: -1, Message: VM sparsify-again was started by admin@internal-authz (Host: host_mixed_1). 2017-02-20 18:43:23,132+02 INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-3) [] VM 'd693a7cd-212f-43ac-b447-e94c368bf2ef' was reported as Down on VDS '88152f46-1930-4d65-b393-bbbe5a90a2cc'(host_mixed_1) 2017-02-20 18:43:23,134+02 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand] (ForkJoinPool-1-worker-3) [] START, DestroyVDSCommand(HostName = host_mixed_1, DestroyVmVDSCommandParameters:{runAsync='true', hostId='88152f46-1930-4d65-b393-bbbe5a90a2cc', vmId='d693a7cd-212f-43ac-b447-e94c368bf2ef', force='false', secondsToWait='0', gracefully='false', reason='', ignoreNoVm='true'}), log id: 31946edc 2017-02-20 18:43:24,150+02 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand] (ForkJoinPool-1-worker-3) [] FINISH, DestroyVDSCommand, log id: 31946edc 2017-02-20 18:43:24,151+02 INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-3) [] VM 'd693a7cd-212f-43ac-b447-e94c368bf2ef'(sparsify-again) moved from 'WaitForLaunch' --> 'Down' 2017-02-20 18:43:24,198+02 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ForkJoinPool-1-worker-3) [] EVENT_ID: VM_DOWN_ERROR(119), Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VM sparsify-again is down with error. Exit message: Bad volume specification {'index': '0', u'domainID': u'1699052c-faf9-47b8-9f2a-04f4c9c67a51', 'reqsize': '0', u'format': u'cow', u'optional': u'false', u'address': {u'function': u'0x0', u'bus': u'0x00', u'domain': u'0x0000', u'type': u'pci', u'slot': u'0x07'}, u'volumeID': u'9c7f8720-b6b2-49d0-a08b-4d2d7b22d0c0', 'apparentsize': '2147483648', u'imageID': u'25caa14f-81b0-4f83-9274-1aec9eb4e09c', u'discard': False, u'specParams': {}, u'readonly': u'false', u'iface': u'virtio', u'deviceId': u'25caa14f-81b0-4f83-9274-1aec9eb4e09c', 'truesize': '2147483648', u'poolID': u'0f761718-c078-40c9-835a-3485109402fd', u'device': u'disk', u'shared': u'false', u'propagateErrors': u'off', u'type': u'disk'}. 2017-02-20 18:43:24,199+02 INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-3) [] add VM 'd693a7cd-212f-43ac-b447-e94c368bf2ef'(sparsify-again) to rerun treatment 2017-02-20 18:43:24,221+02 ERROR [org.ovirt.engine.core.vdsbroker.monitoring.VmsMonitoring] (ForkJoinPool-1-worker-3) [] Rerun VM 'd693a7cd-212f-43ac-b447-e94c368bf2ef'. Called from VDS 'host_mixed_1' 2017-02-20 18:43:24,240+02 WARN [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-6-thread-41) [] EVENT_ID: USER_INITIATED_RUN_VM_FAILED(151), Correlation ID: cbecf807-bc22-4f91-ba81-61e1f69de20d, Job ID: 0e21ec01-e1a2-4952-bca6-3ae9604edaa2, Call Stack: null, Custom Event ID: -1, Message: Failed to run VM sparsify-again on Host host_mixed_1. Version-Release number of selected component (if applicable): rhevm-4.1.1.2-0.1.el7.noarch How reproducible: was able to re produce it several times. Steps to Reproduce: Steps are clear from description. Actual results: Failed to start the vm after sparsify action failed Expected results: Vm should start although the sparsify action failed. Additional info:
failed in vdsm on: 2017-02-20 18:43:20,018+0200 INFO (vm/d693a7cd) [storage.TaskManager.Task (Task='89415bc5-8040-4294-a014-80655c17037e') aborting: Task is aborted: 'Cannot prepare illegal volume' - code 227 (task:1175)
In the test, the storage domain connection break happened after the volume was successfully prepared, but before it was torn down. When the volume was prepared, it was marked as illegal (see vdsm.log): 2017-02-20 18:42:17,415+0200 INFO (tasks/6) [storage.VolumeManifest] sdUUID=1699052c-faf9-47b8-9f2a-04f4c9c67a51 imgUUID=25caa14f-81b0-4f83-9274-1aec9eb4e09c volUUID = 9c7f8720-b6b2-49d0-a08b-4d2d7b22d0c0 legality = ILLEGAL (volume:393) After the sparsify job failed, we tried to tear down the volume, but it also failed: 2017-02-20 18:42:45,679+0200 WARN (tasks/6) [storage.LVM] 9c7f8720-b6b2-49d0-a08b-4d2d7b22d0c0 can't be reloaded, please check your storage connections. (lvm:81) 2017-02-20 18:42:45,679+0200 ERROR (tasks/6) [storage.guarded] Error releasing locks (guarded:116) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/storage/guarded.py", line 111, in __exit__ self._release() File "/usr/lib/python2.7/site-packages/vdsm/storage/guarded.py", line 128, in _release raise ReleaseError(errors) ReleaseError: [AttributeError('Failed reload: 9c7f8720-b6b2-49d0-a08b-4d2d7b22d0c0',)] As result, the volume was left marked as illegal. virt-sparsify does not leave the volume corrupted, even if sparsification was aborted at the middle. Thus, the volume may safely be made legal again. The problem is to determine the correct moment when it may be done (when connection to the SD is restored) and to make sure that the disk is marked as illegal as result of sparsify operation and not for some other reason. I may propose to leave a periodic task at the Engine side if SparsifyImageCommand finished with failure and the disk cannot be reached. This task will poll periodically the storage domain, and when the disk is reachable again, it will mark it as legal. Working on the correct implementation of this.
can't vdsm set it legal again when it fails?
(In reply to Michal Skrivanek from comment #3) > can't vdsm set it legal again when it fails? No, at that moment, because the storage is still unreachable. I thought about leaving the VDSM job running in this case and trying to set the volume legal periodically. But if VDSM dies in the period (and this is not unlikely), the volume will be left illegal. Leaving an Engine job and persisting it to DB is more reliable.
that doesn't sound like a great solution either way. How about manual "make legal" button under disks? Please consult with storage
Richard says that virt-sparsify --inplace will never leave the volume in corrupted state. Nir says that in this case we are able not to mark the volume illegal at the beginning. The lock will be taken in any case, but stale lock will be freed automatically after 80 seconds and the disk will be accessible. SDM.amend_volume verb works this way. To operate similarly, I need to avoid using CopyDataDIVEndpoint in sparsify code and to use amend_volume approach.
(In reply to Shmuel Melamud from comment #6) > Richard says that virt-sparsify --inplace will never leave the volume in > corrupted state. Nir says that in this case we are able not to mark the > volume illegal at the beginning. The lock will be taken in any case, but > stale lock will be freed automatically after 80 seconds and the disk will be > accessible. > > SDM.amend_volume verb works this way. To operate similarly, I need to avoid > using CopyDataDIVEndpoint in sparsify code and to use amend_volume approach. Just to be sure, I re produced the issue again, this time waiting much longer than 80 seconds, but the issue still occurs with the same erros.
(In reply to sefi litmanovich from comment #7) Thanks, so the issue is caused by making the disk illegal before running virt-sparsify, which is not needed. I've posted the patch.
Verified with rhevm-4.1.1.5-0.1.el7 according to steps in description. Waited more than 80 seconds after sparsification failure and after restarting iptables. Starting the vm worked as expected and vm disk wasn't corrupted and booted fine. Also check fsck -n to try to catch any problems with vm's disk. Sparsify action worked fine after stopping the vm again and without network interruption.