Created attachment 1256841 [details] engine and vdsm logs Description of problem: As part of the HSM testing for clone VM from template verb, when restarting the vdsmd service on the HSM that executes the copy data operation, the entire flow will roll back - the VM will not create (expected) but the container LV the was created to hold the data, is not removed: 1) Create the volume container: 2017-02-23 11:23:31,677+02 INFO [org.ovirt.engine.core.bll.storage.disk.image.CreateVolumeContainerCommand] (default task-29) [ee204ad] Running command: CreateVolumeContainerCommand internal: true. 2017-02-23 11:23:31,729+02 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.CreateSnapshotVDSCommand] (default task-29) [ee204ad] START, CreateSnapshotVDSCommand( CreateSnapshotVDSCommandParameters:{runAsync='tr ue', storagePoolId='6708e4f5-b4c9-40de-84d1-43b9e8ef31a5', ignoreFailoverLimit='false', storageDomainId='0369f194-ca21-4cbf-ad28-89380ceaf592', imageGroupId='f620853d-c8af-48a6-902c-2100d1396df3', imageSizeInByt es='10737418240', volumeFormat='COW', newImageId='f957c0d7-da38-4534-acf7-a62ea98cdab1', newImageDescription='null', imageInitialSizeInBytes='9761289310', imageId='00000000-0000-0000-0000-000000000000', sourceIm ageGroupId='00000000-0000-0000-0000-000000000000'}), log id: 41db4584 2) Copy the data to the new volume from 1: 2017-02-23 11:23:40,647+02 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.CopyVolumeDataVDSCommand] (DefaultQuartzScheduler6) [48dd6306] START, CopyVolumeDataVDSCommand(HostName = host_mixed_1, CopyVolumeDataVDSCommandParameters:{runAsync='true', hostId='18773e9a-9a53-4384-bb9a-fe3bada6c6ad', storageDomainId='null', jobId='f9ed6e80-c209-46eb-b520-49d9ac47b504', srcInfo='VdsmImageLocationInfo [storageDomainId=0369f194-ca21-4cbf-ad28-89380ceaf592, imageGroupId=f82c1509-e3ae-4ad0-a2c3-720d00080a06, imageId=a5ae9798-e9c4-4a80-83ba-a95bf3e2e0a4, generation=null]', dstInfo='VdsmImageLocationInfo [storageDomainId=0369f194-ca21-4cbf-ad28-89380ceaf592, imageGroupId=f620853d-c8af-48a6-902c-2100d1396df3, imageId=f957c0d7-da38-4534-acf7-a62ea98cdab1, generation=0]', collapse='true'}), log id: 33fda667 *** newImageId='f957c0d7-da38-4534-acf7-a62ea98cdab1' When running lvs on the HSM host after the operation finished (failed) I see: LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert 7f5e4739-a25e-423a-abc4-4a61930a26d6 0369f194-ca21-4cbf-ad28-89380ceaf592 -wi------- 128.00m a5ae9798-e9c4-4a80-83ba-a95bf3e2e0a4 0369f194-ca21-4cbf-ad28-89380ceaf592 -wi------- 5.50g e0b778bc-459d-4288-81b3-1f757212f292 0369f194-ca21-4cbf-ad28-89380ceaf592 -wi------- 128.00m f957c0d7-da38-4534-acf7-a62ea98cdab1 0369f194-ca21-4cbf-ad28-89380ceaf592 -wi------- 10.00g ids 0369f194-ca21-4cbf-ad28-89380ceaf592 -wi-ao---- 128.00m inbox 0369f194-ca21-4cbf-ad28-89380ceaf592 -wi-a----- 128.00m leases 0369f194-ca21-4cbf-ad28-89380ceaf592 -wi-a----- 2.00g master 0369f194-ca21-4cbf-ad28-89380ceaf592 -wi-ao---- 1.00g metadata 0369f194-ca21-4cbf-ad28-89380ceaf592 -wi-a----- 512.00m outbox 0369f194-ca21-4cbf-ad28-89380ceaf592 -wi-a----- 128.00m xleases 0369f194-ca21-4cbf-ad28-89380ceaf592 -wi-a----- 1.00g Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
This bug report has Keywords: Regression or TestBlocker. Since no regressions or test blockers are allowed between releases, it is also being identified as a blocker for this release. Please resolve ASAP.
Raz, what's the vdsm version that was in use? What's the exact lvs command you ran? was it 'lvs'? After nsoffer's patches that disable lvmetad we shouldn't get lvs that were deleted, regardless - there shouldn't be an effect of having the lvs listed there afaik. Nir, if you can - please take a look as well.
Liron, Sorry for not filling this info in the description: Version-Release number of selected component (if applicable): vdsm-4.19.7-1.el7ev.x86_64 rhevm-4.1.1.4-0.1.el7 How reproducible: 100% Steps to Reproduce: 1. Create a VM from template (clone) 2. Kill vdsmd service right after 'CopyVolumeDataVDSCommand' regex in engine.log 3. Run # lvs command on the SPM and check that there is 1 more LV. ** In some cases, the clone operation will succeed so re-run until the clone VM will fail
Update: before step 3 run: # pvscan --cache
(In reply to Raz Tamir from comment #4) > Update: > before step 3 run: > # pvscan --cache pvscan --cache is not needed since 4.19.5 and 4.18.23, since lvmetad is not used.
(In reply to Liron Aravot from comment #2) > Raz, what's the vdsm version that was in use? > What's the exact lvs command you ran? was it 'lvs'? > > After nsoffer's patches that disable lvmetad we shouldn't get lvs that were > deleted, regardless - there shouldn't be an effect of having the lvs listed > there afaik. > > Nir, if you can - please take a look as well. Disabling lvmetad is not related.
(In reply to Raz Tamir from comment #3) > Steps to Reproduce: > 1. Create a VM from template (clone) > 2. Kill vdsmd service right after 'CopyVolumeDataVDSCommand' regex in > engine.log > 3. Run # lvs command on the SPM and check that there is 1 more LV. Raz, when vdsm disconnects from engine, engine will try to check what happened to the copy job, and if the job failed, engine should delete the unneeded lv. If you run lvs on this cluster while engine is recovering from the network error, you will find the lv on storage. This process takes time, are you sure you waited until engine finished handling the copy operation before you checked if the lv exists?
Hi Nir, The LVs are not deleted at all (checked after few days (: ). I'm running lvs after the environment is recovered and the HSM is running again and the operation was finished (failed)
Thanks Nir/Raz - The problem is that we don't have a revert flow for that operation - the engine doesn't attempt to delete the created images and from initial look it seems like it never did. Raz - can you please test on earlier version to see if it's a regression? I'm working on a fix.
Removing the Regression flag. Thanks Liron
Verified on rhevm-4.1.1.5-0.1.el7 All LVs removed