Bug 1425161 - Vm disk corrupted after virt-sparsify fails due to connection failure
Summary: Vm disk corrupted after virt-sparsify fails due to connection failure
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: vdsm
Classification: oVirt
Component: Core
Version: ---
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ovirt-4.1.1
: 4.19.8
Assignee: Shmuel Melamud
QA Contact: sefi litmanovich
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-02-20 16:59 UTC by sefi litmanovich
Modified: 2017-04-21 09:34 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-04-21 09:34:48 UTC
oVirt Team: Virt
Embargoed:
rule-engine: ovirt-4.1+
rule-engine: blocker+


Attachments (Terms of Use)
engine, vdsm and supervdsm logs (1.03 MB, application/x-gzip)
2017-02-20 16:59 UTC, sefi litmanovich
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 734120 0 high CLOSED [RFE] use virt-sparsify to reduce image size 2021-02-22 00:41:40 UTC
oVirt gerrit 73288 0 master MERGED storage: Do not make volume ILLEGAL before sparsifying it 2017-03-02 13:06:55 UTC
oVirt gerrit 73466 0 ovirt-4.1 MERGED storage: Do not make volume ILLEGAL before sparsifying it 2017-03-08 09:16:01 UTC

Internal Links: 734120

Description sefi litmanovich 2017-02-20 16:59:31 UTC
Created attachment 1255829 [details]
engine, vdsm and supervdsm logs

Description of problem:
1. Created a vm with thin provisioned disk on iscsi storage domain (over xtremIO).
2. Started the vm and created some large file on it and the removed the file.
3. Stopped the vm in order to invoke the sparsify action.
4. Around 2 seconds after invoking the sparsify action I blocked the connection between the host and all iscsi servers that provide the lun from which the vm's diks was created, running on the host:
iptables -A OUTPUT -d a.b.c.x,a.b.c.y,a.b.c.z -j DROP.
The sparsify action failed as expected after it's timeout has exceeded.
5. Then I removed the iptables rule and verified connection has been established again.
6. Then start the vm Failed. From engine.log:

2017-02-20 18:43:21,024+02 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.CreateVDSCommand] (org.ovirt.thread.pool-6-thread-33) [cbecf807-bc22-4f91-ba81-61e1f69de20d] FINISH, CreateVDSCommand, log id: 33df812c
2017-02-20 18:43:21,030+02 INFO  [org.ovirt.engine.core.vdsbroker.CreateVmVDSCommand] (org.ovirt.thread.pool-6-thread-33) [cbecf807-bc22-4f91-ba81-61e1f69de20d] FINISH, CreateVmVDSCommand, return: WaitForLaunch, log id: 77e3051e
2017-02-20 18:43:21,032+02 INFO  [org.ovirt.engine.core.bll.RunVmCommand] (org.ovirt.thread.pool-6-thread-33) [cbecf807-bc22-4f91-ba81-61e1f69de20d] Lock freed to object 'EngineLock:{exclusiveLocks='[d693a7cd-212f-43ac-b447-e94c368bf2ef=<VM, ACTION_TYPE_FAILED_OBJECT_LOCKED>]', sharedLocks='null'}'
2017-02-20 18:43:21,073+02 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-6-thread-33) [cbecf807-bc22-4f91-ba81-61e1f69de20d] EVENT_ID: USER_STARTED_VM(153), Correlation ID: cbecf807-bc22-4f91-ba81-61e1f69de20d, Job ID: 0e21ec01-e1a2-4952-bca6-3ae9604edaa2, Call Stack: null, Custom Event ID: -1, Message: VM sparsify-again was started by admin@internal-authz (Host: host_mixed_1).
2017-02-20 18:43:23,132+02 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-3) [] VM 'd693a7cd-212f-43ac-b447-e94c368bf2ef' was reported as Down on VDS '88152f46-1930-4d65-b393-bbbe5a90a2cc'(host_mixed_1)
2017-02-20 18:43:23,134+02 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand] (ForkJoinPool-1-worker-3) [] START, DestroyVDSCommand(HostName = host_mixed_1, DestroyVmVDSCommandParameters:{runAsync='true', hostId='88152f46-1930-4d65-b393-bbbe5a90a2cc', vmId='d693a7cd-212f-43ac-b447-e94c368bf2ef', force='false', secondsToWait='0', gracefully='false', reason='', ignoreNoVm='true'}), log id: 31946edc
2017-02-20 18:43:24,150+02 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand] (ForkJoinPool-1-worker-3) [] FINISH, DestroyVDSCommand, log id: 31946edc
2017-02-20 18:43:24,151+02 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-3) [] VM 'd693a7cd-212f-43ac-b447-e94c368bf2ef'(sparsify-again) moved from 'WaitForLaunch' --> 'Down'
2017-02-20 18:43:24,198+02 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ForkJoinPool-1-worker-3) [] EVENT_ID: VM_DOWN_ERROR(119), Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VM sparsify-again is down with error. Exit message: Bad volume specification {'index': '0', u'domainID': u'1699052c-faf9-47b8-9f2a-04f4c9c67a51', 'reqsize': '0', u'format': u'cow', u'optional': u'false', u'address': {u'function': u'0x0', u'bus': u'0x00', u'domain': u'0x0000', u'type': u'pci', u'slot': u'0x07'}, u'volumeID': u'9c7f8720-b6b2-49d0-a08b-4d2d7b22d0c0', 'apparentsize': '2147483648', u'imageID': u'25caa14f-81b0-4f83-9274-1aec9eb4e09c', u'discard': False, u'specParams': {}, u'readonly': u'false', u'iface': u'virtio', u'deviceId': u'25caa14f-81b0-4f83-9274-1aec9eb4e09c', 'truesize': '2147483648', u'poolID': u'0f761718-c078-40c9-835a-3485109402fd', u'device': u'disk', u'shared': u'false', u'propagateErrors': u'off', u'type': u'disk'}.
2017-02-20 18:43:24,199+02 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-3) [] add VM 'd693a7cd-212f-43ac-b447-e94c368bf2ef'(sparsify-again) to rerun treatment
2017-02-20 18:43:24,221+02 ERROR [org.ovirt.engine.core.vdsbroker.monitoring.VmsMonitoring] (ForkJoinPool-1-worker-3) [] Rerun VM 'd693a7cd-212f-43ac-b447-e94c368bf2ef'. Called from VDS 'host_mixed_1'
2017-02-20 18:43:24,240+02 WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-6-thread-41) [] EVENT_ID: USER_INITIATED_RUN_VM_FAILED(151), Correlation ID: cbecf807-bc22-4f91-ba81-61e1f69de20d, Job ID: 0e21ec01-e1a2-4952-bca6-3ae9604edaa2, Call Stack: null, Custom Event ID: -1, Message: Failed to run VM sparsify-again on Host host_mixed_1.



Version-Release number of selected component (if applicable):
rhevm-4.1.1.2-0.1.el7.noarch

How reproducible:
was able to re produce it several times.

Steps to Reproduce:
Steps are clear from description.

Actual results:
Failed to start the vm after sparsify action failed

Expected results:
Vm should start although the sparsify action failed.

Additional info:

Comment 1 Michal Skrivanek 2017-02-20 17:16:54 UTC
failed in vdsm on:

2017-02-20 18:43:20,018+0200 INFO  (vm/d693a7cd) [storage.TaskManager.Task (Task='89415bc5-8040-4294-a014-80655c17037e') aborting: Task is aborted: 'Cannot prepare illegal volume' - code 227 (task:1175)

Comment 2 Shmuel Melamud 2017-02-24 01:38:28 UTC
In the test, the storage domain connection break happened after the volume was successfully prepared, but before it was torn down.

When the volume was prepared, it was marked as illegal (see vdsm.log):

2017-02-20 18:42:17,415+0200 INFO  (tasks/6) [storage.VolumeManifest] sdUUID=1699052c-faf9-47b8-9f2a-04f4c9c67a51 imgUUID=25caa14f-81b0-4f83-9274-1aec9eb4e09c volUUID = 9c7f8720-b6b2-49d0-a08b-4d2d7b22d0c0 legality = ILLEGAL  (volume:393)

After the sparsify job failed, we tried to tear down the volume, but it also failed:

2017-02-20 18:42:45,679+0200 WARN  (tasks/6) [storage.LVM] 9c7f8720-b6b2-49d0-a08b-4d2d7b22d0c0 can't be reloaded, please check your storage connections. (lvm:81)
2017-02-20 18:42:45,679+0200 ERROR (tasks/6) [storage.guarded] Error releasing locks (guarded:116)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/storage/guarded.py", line 111, in __exit__
    self._release()
  File "/usr/lib/python2.7/site-packages/vdsm/storage/guarded.py", line 128, in _release
    raise ReleaseError(errors)
ReleaseError: [AttributeError('Failed reload: 9c7f8720-b6b2-49d0-a08b-4d2d7b22d0c0',)]

As result, the volume was left marked as illegal.

virt-sparsify does not leave the volume corrupted, even if sparsification was aborted at the middle. Thus, the volume may safely be made legal again. The problem is to determine the correct moment when it may be done (when connection to the SD is restored) and to make sure that the disk is marked as illegal as result of sparsify operation and not for some other reason. I may propose to leave a periodic task at the Engine side if SparsifyImageCommand finished with failure and the disk cannot be reached. This task will poll periodically the storage domain, and when the disk is reachable again, it will mark it as legal.

Working on the correct implementation of this.

Comment 3 Michal Skrivanek 2017-02-27 09:09:06 UTC
can't vdsm set it legal again when it fails?

Comment 4 Shmuel Melamud 2017-02-27 09:53:06 UTC
(In reply to Michal Skrivanek from comment #3)
> can't vdsm set it legal again when it fails?

No, at that moment, because the storage is still unreachable.

I thought about leaving the VDSM job running in this case and trying to set the volume legal periodically. But if VDSM dies in the period (and this is not unlikely), the volume will be left illegal. Leaving an Engine job and persisting it to DB is more reliable.

Comment 5 Michal Skrivanek 2017-02-27 11:15:39 UTC
that doesn't sound like a great solution either way. How about manual "make legal" button under disks? Please consult with storage

Comment 6 Shmuel Melamud 2017-02-28 11:35:45 UTC
Richard says that virt-sparsify --inplace will never leave the volume in corrupted state. Nir says that in this case we are able not to mark the volume illegal at the beginning. The lock will be taken in any case, but stale lock will be freed automatically after 80 seconds and the disk will be accessible.

SDM.amend_volume verb works this way. To operate similarly, I need to avoid using CopyDataDIVEndpoint in sparsify code and to use amend_volume approach.

Comment 7 sefi litmanovich 2017-03-05 15:50:59 UTC
(In reply to Shmuel Melamud from comment #6)
> Richard says that virt-sparsify --inplace will never leave the volume in
> corrupted state. Nir says that in this case we are able not to mark the
> volume illegal at the beginning. The lock will be taken in any case, but
> stale lock will be freed automatically after 80 seconds and the disk will be
> accessible.
> 
> SDM.amend_volume verb works this way. To operate similarly, I need to avoid
> using CopyDataDIVEndpoint in sparsify code and to use amend_volume approach.

Just to be sure, I re produced the issue again, this time waiting much longer than 80 seconds, but the issue still occurs with the same erros.

Comment 8 Shmuel Melamud 2017-03-05 15:58:38 UTC
(In reply to sefi litmanovich from comment #7)
Thanks, so the issue is caused by making the disk illegal before running virt-sparsify, which is not needed. I've posted the patch.

Comment 11 sefi litmanovich 2017-03-19 13:26:29 UTC
Verified with rhevm-4.1.1.5-0.1.el7 according to steps in description.
Waited more than 80 seconds after sparsification failure and after restarting iptables. Starting the vm worked as expected and vm disk wasn't corrupted and booted fine. Also check fsck -n to try to catch any problems with vm's disk.
Sparsify action worked fine after stopping the vm again and without network interruption.


Note You need to log in before you can comment on or make changes to this bug.