Bug 1426136 - Restarting vdsmd service on HSM host that copies data for cloning VM from template will cause that the LV container is not removed
Summary: Restarting vdsmd service on HSM host that copies data for cloning VM from tem...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: ovirt-engine
Classification: oVirt
Component: BLL.Storage
Version: 4.1.1.2
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ovirt-4.1.1
: 4.1.1.5
Assignee: Liron Aravot
QA Contact: Raz Tamir
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-02-23 09:35 UTC by Raz Tamir
Modified: 2017-04-21 09:49 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-04-21 09:49:34 UTC
oVirt Team: Storage
Embargoed:
rule-engine: ovirt-4.1+
rule-engine: blocker+


Attachments (Terms of Use)
engine and vdsm logs (852.44 KB, application/x-gzip)
2017-02-23 09:35 UTC, Raz Tamir
no flags Details


Links
System ID Private Priority Status Summary Last Updated
oVirt gerrit 74010 0 master MERGED core: vm from template - revert flow 2017-03-14 13:40:16 UTC
oVirt gerrit 74064 0 ovirt-engine-4.1 MERGED core: vm from template - revert flow 2017-03-14 16:59:00 UTC
oVirt gerrit 74078 0 ovirt-engine-4.1.1.z MERGED core: vm from template - revert flow 2017-03-14 16:59:39 UTC

Description Raz Tamir 2017-02-23 09:35:04 UTC
Created attachment 1256841 [details]
engine and vdsm logs

Description of problem:
As part of the HSM testing for clone VM from template verb, when restarting the vdsmd service on the HSM that executes the copy data operation, the entire flow will roll back - the VM will not create (expected) but the container LV the was created to hold the data, is not removed:

1) Create the volume container:
2017-02-23 11:23:31,677+02 INFO  [org.ovirt.engine.core.bll.storage.disk.image.CreateVolumeContainerCommand] (default task-29) [ee204ad] Running command: CreateVolumeContainerCommand internal: true.
2017-02-23 11:23:31,729+02 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.CreateSnapshotVDSCommand] (default task-29) [ee204ad] START, CreateSnapshotVDSCommand( CreateSnapshotVDSCommandParameters:{runAsync='tr
ue', storagePoolId='6708e4f5-b4c9-40de-84d1-43b9e8ef31a5', ignoreFailoverLimit='false', storageDomainId='0369f194-ca21-4cbf-ad28-89380ceaf592', imageGroupId='f620853d-c8af-48a6-902c-2100d1396df3', imageSizeInByt
es='10737418240', volumeFormat='COW', newImageId='f957c0d7-da38-4534-acf7-a62ea98cdab1', newImageDescription='null', imageInitialSizeInBytes='9761289310', imageId='00000000-0000-0000-0000-000000000000', sourceIm
ageGroupId='00000000-0000-0000-0000-000000000000'}), log id: 41db4584

2) Copy the data to the new volume from 1:
2017-02-23 11:23:40,647+02 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.CopyVolumeDataVDSCommand] (DefaultQuartzScheduler6) [48dd6306] START, CopyVolumeDataVDSCommand(HostName = host_mixed_1, CopyVolumeDataVDSCommandParameters:{runAsync='true', hostId='18773e9a-9a53-4384-bb9a-fe3bada6c6ad', storageDomainId='null', jobId='f9ed6e80-c209-46eb-b520-49d9ac47b504', srcInfo='VdsmImageLocationInfo [storageDomainId=0369f194-ca21-4cbf-ad28-89380ceaf592, imageGroupId=f82c1509-e3ae-4ad0-a2c3-720d00080a06, imageId=a5ae9798-e9c4-4a80-83ba-a95bf3e2e0a4, generation=null]', dstInfo='VdsmImageLocationInfo [storageDomainId=0369f194-ca21-4cbf-ad28-89380ceaf592, imageGroupId=f620853d-c8af-48a6-902c-2100d1396df3, imageId=f957c0d7-da38-4534-acf7-a62ea98cdab1, generation=0]', collapse='true'}), log id: 33fda667


*** newImageId='f957c0d7-da38-4534-acf7-a62ea98cdab1'

When running lvs on the HSM host after the operation finished (failed) I see:
  LV                                   VG                                   Attr       LSize   Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert
  7f5e4739-a25e-423a-abc4-4a61930a26d6 0369f194-ca21-4cbf-ad28-89380ceaf592 -wi------- 128.00m                                                    
  a5ae9798-e9c4-4a80-83ba-a95bf3e2e0a4 0369f194-ca21-4cbf-ad28-89380ceaf592 -wi-------   5.50g                                                    
  e0b778bc-459d-4288-81b3-1f757212f292 0369f194-ca21-4cbf-ad28-89380ceaf592 -wi------- 128.00m                                                    
  f957c0d7-da38-4534-acf7-a62ea98cdab1 0369f194-ca21-4cbf-ad28-89380ceaf592 -wi-------  10.00g                                                    
  ids                                  0369f194-ca21-4cbf-ad28-89380ceaf592 -wi-ao---- 128.00m                                                    
  inbox                                0369f194-ca21-4cbf-ad28-89380ceaf592 -wi-a----- 128.00m                                                    
  leases                               0369f194-ca21-4cbf-ad28-89380ceaf592 -wi-a-----   2.00g                                                    
  master                               0369f194-ca21-4cbf-ad28-89380ceaf592 -wi-ao----   1.00g                                                    
  metadata                             0369f194-ca21-4cbf-ad28-89380ceaf592 -wi-a----- 512.00m                                                    
  outbox                               0369f194-ca21-4cbf-ad28-89380ceaf592 -wi-a----- 128.00m                                                    
  xleases                              0369f194-ca21-4cbf-ad28-89380ceaf592 -wi-a-----   1.00g                                                    



Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Red Hat Bugzilla Rules Engine 2017-02-26 13:12:36 UTC
This bug report has Keywords: Regression or TestBlocker.
Since no regressions or test blockers are allowed between releases, it is also being identified as a blocker for this release. Please resolve ASAP.

Comment 2 Liron Aravot 2017-03-12 15:46:49 UTC
Raz, what's the vdsm version that was in use?
What's the exact lvs command you ran? was it 'lvs'?

After nsoffer's patches that disable lvmetad we shouldn't get lvs that were deleted, regardless - there shouldn't be an effect of having the lvs listed there afaik.

Nir, if you can - please take a look as well.

Comment 3 Raz Tamir 2017-03-12 15:54:49 UTC
Liron,
Sorry for not filling this info in the description:

Version-Release number of selected component (if applicable):
vdsm-4.19.7-1.el7ev.x86_64
rhevm-4.1.1.4-0.1.el7

How reproducible:
100%


Steps to Reproduce:
1. Create a VM from template (clone)
2. Kill vdsmd service right after 'CopyVolumeDataVDSCommand' regex in engine.log
3. Run # lvs command on the SPM and check that there is 1 more LV.

** In some cases, the clone operation will succeed so re-run until the clone VM will fail

Comment 4 Raz Tamir 2017-03-12 16:23:18 UTC
Update:
before step 3 run:
# pvscan --cache

Comment 5 Nir Soffer 2017-03-12 16:35:35 UTC
(In reply to Raz Tamir from comment #4)
> Update:
> before step 3 run:
> # pvscan --cache

pvscan --cache is not needed since 4.19.5 and 4.18.23, since lvmetad is not used.

Comment 6 Nir Soffer 2017-03-12 16:50:19 UTC
(In reply to Liron Aravot from comment #2)
> Raz, what's the vdsm version that was in use?
> What's the exact lvs command you ran? was it 'lvs'?
> 
> After nsoffer's patches that disable lvmetad we shouldn't get lvs that were
> deleted, regardless - there shouldn't be an effect of having the lvs listed
> there afaik.
> 
> Nir, if you can - please take a look as well.

Disabling lvmetad is not related.

Comment 7 Nir Soffer 2017-03-12 16:55:31 UTC
(In reply to Raz Tamir from comment #3)
> Steps to Reproduce:
> 1. Create a VM from template (clone)
> 2. Kill vdsmd service right after 'CopyVolumeDataVDSCommand' regex in
> engine.log
> 3. Run # lvs command on the SPM and check that there is 1 more LV.

Raz, when vdsm disconnects from engine, engine will try to check what happened
to the copy job, and if the job failed, engine should delete the unneeded lv.

If you run lvs on this cluster while engine is recovering from the network
error, you will find the lv on storage.

This process takes time, are you sure you waited until engine finished handling
the copy operation before you checked if the lv exists?

Comment 8 Raz Tamir 2017-03-12 17:07:14 UTC
Hi Nir,
The LVs are not deleted at all (checked after few days (: ).

I'm running lvs after the environment is recovered and the HSM is running again and the operation was finished (failed)

Comment 9 Liron Aravot 2017-03-12 17:32:37 UTC
Thanks Nir/Raz -
The problem is that we don't have a revert flow for that operation - the engine doesn't attempt to delete the created images and from initial look it seems like it never did.
Raz - can you please test on earlier version to see if it's a regression?

I'm working on a fix.

Comment 10 Raz Tamir 2017-03-12 18:32:03 UTC
Removing the Regression flag.
Thanks Liron

Comment 11 Raz Tamir 2017-03-19 08:14:04 UTC
Verified on rhevm-4.1.1.5-0.1.el7
All LVs removed


Note You need to log in before you can comment on or make changes to this bug.