Bug 1124498

Summary: Corrupted volume chain with a cycle, leading to endless loop in Image.getChain
Product: [Retired] oVirt Reporter: Ori Gofen <ogofen>
Component: ovirt-engine-coreAssignee: Nir Soffer <nsoffer>
Status: CLOSED CURRENTRELEASE QA Contact: Ori Gofen <ogofen>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 3.5CC: acanan, alitke, amureini, bugs, ecohen, fsimonce, gklein, gpadgett, iheim, michal.skrivanek, nsoffer, ogofen, rbalakri, yeylon
Target Milestone: ---   
Target Release: 3.5.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: storage
Fixed In Version: v4.16.2 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2014-10-17 12:37:31 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Storage RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
vdsm+engine logs
none
Repeating calls to validateVolumePath
none
Log without ioprocess logs
none
this is the missing log as far as I understood
none
vdsm+engine logs none

Description Ori Gofen 2014-07-29 15:56:07 UTC
Created attachment 922190 [details]
vdsm+engine logs

Description of problem:
Repeating create template from vm operation on the same vm,cause vdsm to loop again and again through 

Thread-15::DEBUG::2014-07-29 18:46:30,771::__init__::225::IOProcess::(_processLogs) Waiting for next request...


Version-Release number of selected component (if applicable):
beta.2

How reproducible:
100%

Steps to Reproduce:
1.create template from vm
2.create template again from the same vm

Actual results:
infinite loop

Expected results:
operation should succeed

Additional info:

Comment 1 Allon Mureinik 2014-07-30 14:25:56 UTC
Nir - is this us or IOProcess itself?

Comment 2 Nir Soffer 2014-07-30 15:05:15 UTC
(In reply to Allon Mureinik from comment #1)
> Nir - is this us or IOProcess itself?

Maybe.

Ori, please disable ipprocess and repeat this test. Can you reproduce it when ioprocess is disabled?

To disable ioprocess, edit vdsm.conf and change:
    
    [irs]
    oop_impl = rfh

And restart vdsm.

Comment 3 Nir Soffer 2014-07-30 16:44:19 UTC
Created attachment 922645 [details]
Repeating calls to validateVolumePath

Since the bug description does not match the logs - we see:

Thread-344::INFO::2014-07-29 18:31:38,767::logUtils::44::dispatcher::(wrapper) Run and protect: copyImage(sdUUID='c0259524-af0c-4860-b6fc-58cbd8e97f68', spUUID='00000002-0002-0002-0002-000000000200', vmUUID='', 
srcImgUUID='8400c2f4-e0c7-4eff-aca2-65448d2cf81c', srcVolUUID='2758889b-2b68-4c64-9e04-a92d84281e7a', dstImgUUID='fdc14974-04a5-41a2-aed1-2a493383f048', dstVolUUID='a1b727dc-e966-441f-b5ec-8f5d69507705', descrip
tion='Active VM', dstSdUUID='c0259524-af0c-4860-b6fc-58cbd8e97f68', volType=6, volFormat=5, preallocate=2, postZero='false', force='false')

And then we have 19591 repeat ions of the line:

Thread-64::DEBUG::<timestamp>::fileVolume::529::Storage.Volume::(validateVolumePath) validate path for 9c9f9c84-278e-4711-a9c6-7ebcf50a0dda

Starting at:

Thread-344::DEBUG::2014-07-29 18:31:38,928::fileVolume::529::Storage.Volume::(validateVolumePath) validate path for 9c9f9c84-278e-4711-a9c6-7ebcf50a0dda


Ending at:

Thread-64::DEBUG::2014-07-29 18:50:07,307::fileVolume::529::Storage.Volume::(validateVolumePath) validate path for 9c9f9c84-278e-4711-a9c6-7ebcf50a0dda

Comment 4 Nir Soffer 2014-07-30 17:03:08 UTC
Correction - there are two instances of this:

1. Copy 1

Thread-344::INFO::2014-07-29 18:31:38,767::logUtils::44::dispatcher::(wrapper) Run and protect: copyImage(sdUUID='c0259524-af0c-4860-b6fc-58cbd8e97f68', spUUID='00000002-0002-0002-0002-000000000200', vmUUID='', 
srcImgUUID='8400c2f4-e0c7-4eff-aca2-65448d2cf81c', srcVolUUID='2758889b-2b68-4c64-9e04-a92d84281e7a', dstImgUUID='fdc14974-04a5-41a2-aed1-2a493383f048', dstVolUUID='a1b727dc-e966-441f-b5ec-8f5d69507705', descrip
tion='Active VM', dstSdUUID='c0259524-af0c-4860-b6fc-58cbd8e97f68', volType=6, volFormat=5, preallocate=2, postZero='false', force='false')

Then 8516 lines:

Thread-344::DEBUG::2014-07-29 18:31:38,928::fileVolume::529::Storage.Volume::(validateVolumePath) validate path for 9c9f9c84-278e-4711-a9c6-7ebcf50a0dda
...
Thread-344::DEBUG::2014-07-29 18:36:45,843::fileVolume::529::Storage.Volume::(validateVolumePath) validate path for 9c9f9c84-278e-4711-a9c6-7ebcf50a0dda

2. Copy 2

Thread-64::INFO::2014-07-29 18:42:46,376::logUtils::44::dispatcher::(wrapper) Run and protect: copyImage(sdUUID='c0259524-af0c-4860-b6fc-58cbd8e97f68', spUUID='00000002-0002-0002-0002-000000000200', vmUUID='', s
rcImgUUID='8400c2f4-e0c7-4eff-aca2-65448d2cf81c', srcVolUUID='2758889b-2b68-4c64-9e04-a92d84281e7a', dstImgUUID='dfe17284-2434-47fe-b934-cb9e21d18cfa', dstVolUUID='b53d645a-aabf-4ab1-8579-097dfd419ba0', descript
ion='Active VM', dstSdUUID='c0259524-af0c-4860-b6fc-58cbd8e97f68', volType=6, volFormat=5, preallocate=2, postZero='false', force='false')

Then 11075 lines:

Thread-64::DEBUG::2014-07-29 18:42:46,681::fileVolume::529::Storage.Volume::(validateVolumePath) validate path for 9c9f9c84-278e-4711-a9c6-7ebcf50a0dda
...
Thread-64::DEBUG::2014-07-29 18:50:07,307::fileVolume::529::Storage.Volume::(validateVolumePath) validate path for 9c9f9c84-278e-4711-a9c6-7ebcf50a0dda

Comment 5 Nir Soffer 2014-07-30 17:13:22 UTC
Created attachment 922647 [details]
Log without ioprocess logs

This is the original log without ioprocess logs - this make it much easier to understand the flow.

Comment 6 Nir Soffer 2014-07-30 20:12:18 UTC
Ori, there must be some step missing in your description, since I cannot reproduce this.

1. I create vm on a data center with 2 NFS storage domains
2. I created 2 snapshots, so vm disk has 3 volumes (your vm has 4 volumes) 
3. I created a template - ok
4. I created another snapshot from same vm - ok

Comment 7 Nir Soffer 2014-07-30 21:54:13 UTC
Ori, it seems that you have a invalid volume chain.

Can you try the attached patch? It will raise an exception immediately if your volume chain is invalid.

Comment 8 Allon Mureinik 2014-07-31 04:29:15 UTC
Do we know how the volume chain became invalid?

Comment 9 Federico Simoncelli 2014-07-31 09:16:15 UTC
We should find out what caused the issue to begin with.

The history of image 8400c2f4-e0c7-4eff-aca2-65448d2cf81c is roughly:

- creation:

2014-07-29 14:23:12,519 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.CreateImageVDSCommand] (ajp--127.0.0.1-8702-5) [14914841] START, CreateImageVDSCommand( storagePoolId = 00000002-0002-0002-0002-000000000200, ignoreFailoverLimit = false, storageDomainId = 7b7b2e0b-3814-4203-8bf3-bfacda5c604e, imageGroupId = 8400c2f4-e0c7-4eff-aca2-65448d2cf81c, imageSizeInBytes = 3221225472, volumeFormat = RAW, newImageId = 175b0f9e-035b-40a5-a48d-67cf755d0eb5, newImageDescription = ), log id: 7216b01c

- first template creation (at first sight irrelevant to bz):

2014-07-29 14:23:51,082 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.CopyImageVDSCommand] (ajp--127.0.0.1-8702-2) [2a9f8be4] START, CopyImageVDSCommand( storagePoolId = 00000002-0002-0002-0002-000000000200, ignoreFailoverLimit = false, storageDomainId = 7b7b2e0b-3814-4203-8bf3-bfacda5c604e, imageGroupId = 8400c2f4-e0c7-4eff-aca2-65448d2cf81c, imageId = 175b0f9e-035b-40a5-a48d-67cf755d0eb5, dstImageGroupId = f647d292-3422-4cb6-bde7-4a90ddb29415, vmId = f4aa40f2-8b43-427a-bdfd-bf790ebe6ff8, dstImageId = 37724f2f-4e3d-4cd2-b9b4-8bcca7459f35, imageDescription = Active VM, dstStorageDomainId = 7b7b2e0b-3814-4203-8bf3-bfacda5c604e, copyVolumeType = SharedVol, volumeFormat = RAW, preallocate = Sparse, postZero = false, force = false), log id: 44fdf11f

- live migration to another storage domain (with live snapshot):

2014-07-29 14:33:24,274 INFO  [org.ovirt.engine.core.bll.lsm.LiveMigrateVmDisksCommand] (org.ovirt.thread.pool-8-thread-8) [56926870] Running command: LiveMigrateVmDisksCommand Task handler: LiveSnapshotTaskHandler internal: false. Entities affected :  ID: 8400c2f4-e0c7-4eff-aca2-65448d2cf81c Type: Disk,  ID: c0259524-af0c-4860-b6fc-58cbd8e97f68 Type: Storage

2014-07-29 14:33:24,789 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.CreateSnapshotVDSCommand] (org.ovirt.thread.pool-8-thread-8) [652ddd43] START, CreateSnapshotVDSCommand( storagePoolId = 00000002-0002-0002-0002-000000000200, ignoreFailoverLimit = false, storageDomainId = 7b7b2e0b-3814-4203-8bf3-bfacda5c604e, imageGroupId = 8400c2f4-e0c7-4eff-aca2-65448d2cf81c, imageSizeInBytes = 3221225472, volumeFormat = COW, newImageId = 9c9f9c84-278e-4711-a9c6-7ebcf50a0dda, newImageDescription = , imageId = 175b0f9e-035b-40a5-a48d-67cf755d0eb5, sourceImageGroupId = 8400c2f4-e0c7-4eff-aca2-65448d2cf81c), log id: 68bee305

2014-07-29 16:24:35,663 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.CreateSnapshotVDSCommand] (org.ovirt.thread.pool-8-thread-24) [2f175678] START, CreateSnapshotVDSCommand( storagePoolId = 00000002-0002-0002-0002-000000000200, ignoreFailoverLimit = false, storageDomainId = c0259524-af0c-4860-b6fc-58cbd8e97f68, imageGroupId = 8400c2f4-e0c7-4eff-aca2-65448d2cf81c, imageSizeInBytes = 3221225472, volumeFormat = COW, newImageId = 73c14704-db08-403f-9b32-79f83b1243ab, newImageDescription = , imageId = 9c9f9c84-278e-4711-a9c6-7ebcf50a0dda, sourceImageGroupId = 8400c2f4-e0c7-4eff-aca2-65448d2cf81c), log id: 31aef3c6

- snapshot creation:

2014-07-29 16:24:48,003 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.CreateSnapshotVDSCommand] (org.ovirt.thread.pool-8-thread-28) [32a07c2f] START, CreateSnapshotVDSCommand( storagePoolId = 00000002-0002-0002-0002-000000000200, ignoreFailoverLimit = false, storageDomainId = c0259524-af0c-4860-b6fc-58cbd8e97f68, imageGroupId = 8400c2f4-e0c7-4eff-aca2-65448d2cf81c, imageSizeInBytes = 3221225472, volumeFormat = COW, newImageId = 2758889b-2b68-4c64-9e04-a92d84281e7a, newImageDescription = , imageId = 73c14704-db08-403f-9b32-79f83b1243ab, sourceImageGroupId = 8400c2f4-e0c7-4eff-aca2-65448d2cf81c), log id: 467a10bd

- cold merge command failure (we should check what was the problem here):

2014-07-29 16:37:43,047 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.MergeVDSCommand] (pool-7-thread-2) [154700c3] START, MergeVDSCommand(HostName = vdsb, MergeVDSCommandParameters{HostId = b20b9d2d-d5de-4bef-ab1a-d044df43d90b, vmId=f4aa40f2-8b43-427a-bdfd-bf790ebe6ff8, storagePoolId=00000002-0002-0002-0002-000000000200, storageDomainId=c0259524-af0c-4860-b6fc-58cbd8e97f68, imageGroupId=8400c2f4-e0c7-4eff-aca2-65448d2cf81c, imageId=2758889b-2b68-4c64-9e04-a92d84281e7a, baseImageId=9c9f9c84-278e-4711-a9c6-7ebcf50a0dda, topImageId=73c14704-db08-403f-9b32-79f83b1243ab, bandwidth=0}), log id: 6549f990

2014-07-29 16:37:43,203 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.MergeVDSCommand] (pool-7-thread-2) [154700c3] Command MergeVDSCommand(HostName = vdsb, MergeVDSCommandParameters{HostId = b20b9d2d-d5de-4bef-ab1a-d044df43d90b, vmId=f4aa40f2-8b43-427a-bdfd-bf790ebe6ff8, storagePoolId=00000002-0002-0002-0002-000000000200, storageDomainId=c0259524-af0c-4860-b6fc-58cbd8e97f68, imageGroupId=8400c2f4-e0c7-4eff-aca2-65448d2cf81c, imageId=2758889b-2b68-4c64-9e04-a92d84281e7a, baseImageId=9c9f9c84-278e-4711-a9c6-7ebcf50a0dda, topImageId=73c14704-db08-403f-9b32-79f83b1243ab, bandwidth=0}) execution failed. Exception: VDSErrorException: VDSGenericException: VDSErrorException: Failed to MergeVDS, error = Merge failed, code = 52

- I think these are the last template creations (mentionend in this bug):

2014-07-29 18:31:36,871 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.CopyImageVDSCommand] (ajp--127.0.0.1-8702-7) [6c7fcf06] START, CopyImageVDSCommand( storagePoolId = 00000002-0002-0002-0002-000000000200, ignoreFailoverLimit = false, storageDomainId = c0259524-af0c-4860-b6fc-58cbd8e97f68, imageGroupId = 8400c2f4-e0c7-4eff-aca2-65448d2cf81c, imageId = 2758889b-2b68-4c64-9e04-a92d84281e7a, dstImageGroupId = fdc14974-04a5-41a2-aed1-2a493383f048, vmId = f4aa40f2-8b43-427a-bdfd-bf790ebe6ff8, dstImageId = a1b727dc-e966-441f-b5ec-8f5d69507705, imageDescription = Active VM, dstStorageDomainId = c0259524-af0c-4860-b6fc-58cbd8e97f68, copyVolumeType = SharedVol, volumeFormat = RAW, preallocate = Sparse, postZero = false, force = false), log id: 24e4b4d7

2014-07-29 18:42:44,488 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.CopyImageVDSCommand] (ajp--127.0.0.1-8702-3) [768d9f5e] START, CopyImageVDSCommand( storagePoolId = 00000002-0002-0002-0002-000000000200, ignoreFailoverLimit = false, storageDomainId = c0259524-af0c-4860-b6fc-58cbd8e97f68, imageGroupId = 8400c2f4-e0c7-4eff-aca2-65448d2cf81c, imageId = 2758889b-2b68-4c64-9e04-a92d84281e7a, dstImageGroupId = dfe17284-2434-47fe-b934-cb9e21d18cfa, vmId = f4aa40f2-8b43-427a-bdfd-bf790ebe6ff8, dstImageId = b53d645a-aabf-4ab1-8579-097dfd419ba0, imageDescription = Active VM, dstStorageDomainId = c0259524-af0c-4860-b6fc-58cbd8e97f68, copyVolumeType = SharedVol, volumeFormat = RAW, preallocate = Sparse, postZero = false, force = false), log id: 6c4e649d

2014-07-29 18:46:07,002 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.CopyImageVDSCommand] (ajp--127.0.0.1-8702-2) [6ac01e5f] START, CopyImageVDSCommand( storagePoolId = 00000002-0002-0002-0002-000000000200, ignoreFailoverLimit = false, storageDomainId = c0259524-af0c-4860-b6fc-58cbd8e97f68, imageGroupId = 8400c2f4-e0c7-4eff-aca2-65448d2cf81c, imageId = 2758889b-2b68-4c64-9e04-a92d84281e7a, dstImageGroupId = 44ffa380-f059-4942-8dcd-1d3aa0f2974f, vmId = f4aa40f2-8b43-427a-bdfd-bf790ebe6ff8, dstImageId = 3bdaed41-681f-4f4e-9d57-b116e0616110, imageDescription = Active VM, dstStorageDomainId = c0259524-af0c-4860-b6fc-58cbd8e97f68, copyVolumeType = SharedVol, volumeFormat = RAW, preallocate = Sparse, postZero = false, force = false), log id: 79d2f636

Comment 10 Nir Soffer 2014-07-31 09:35:40 UTC
This issue seems to be cause by corrupted volume chain. We don't know yet why the volume chain got corrupted, but this is not an urgent issue.

I'm not sure we have enough info for the image corruption issue, since we don't log the contents of the volume metadata.

Ori, please upload these files:

ls /rhev/data-center/<pool-uuid>/<domain-uuid>/images/8400c2f4-e0c7-4eff-aca2-65448d2cf81c/*.meta

And please do not delete any logs on the machines involved in this bug. The best thing would be to copy all vdsm logs on this machines to some other machines since we will need them for investigation.

Comment 11 Nir Soffer 2014-07-31 09:57:30 UTC
Adding back needinfo for Ori.

Comment 12 Federico Simoncelli 2014-07-31 22:07:41 UTC
What we miss here are the vdsm logs of the failed merge mentioned in comment 9.

In addition comment 9 describes what was the image history, and it gives us the steps for the reproducer.

I also just noticed that the merge is in reality a tentative of a live merge (please double check).

Adding Greg and Adam to the CC of this bug.

In my opinion a reproducer could be:

- create image
- take snapshot
- take snapshot
- live merge internal snapshot (fails)
- create template (fails with infinite loop)

Adam you probably need to check if there something on the vdsm side that could generate this issue.

Comment 13 Adam Litke 2014-08-01 15:29:36 UTC
The attached vdsm logs do not contain evidence of the failed live merge operation.  Please attach logs for a reproduction of the issue that include calling the merge verb.

Comment 14 Ori Gofen 2014-08-03 10:37:47 UTC
Created attachment 923577 [details]
this is the missing log as far as I understood

Comment 15 Ori Gofen 2014-08-03 10:40:21 UTC
(In reply to Nir Soffer from comment #11)
> Adding back needinfo for Ori.

unfortunately all images have been cleared from host,I'll reproduce this one and get back to you

Comment 16 Adam Litke 2014-08-04 15:36:47 UTC
Looking forward to the reproduction.  Looking at your log for the live merge, it looks like engine's understanding of the volume chain was already wrong:

Thread-567::ERROR::2014-07-29 16:37:45,012::vm::5542::vm.Vm::(merge) vmId=`f4aa40f2-8b43-427a-bdfd-bf790ebe6ff8`::merge: base volume '9c9f9c84-278e-4711-a9c6-7ebcf50a0dda' not found

This happens when engine specifies a base volume UUID that does not exist in the current volume chain.

Earlier in this same log I see the following error that leads me to believe the volume chain was corrupted much earlier:

1cbf4087-f902-46dd-8598-24f325d61102::ERROR::2014-07-29 15:04:21,585::task::866::Storage.TaskManager.Task::(_setError) Task=`1cbf4087-f902-46dd-8598-24f325d61102`::Unexpected error
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/task.py", line 873, in _run
    return fn(*args, **kargs)
  File "/usr/share/vdsm/storage/task.py", line 334, in run
    return self.cmd(*self.argslist, **self.argsdict)
  File "/usr/share/vdsm/storage/securable.py", line 77, in wrapper
    return method(self, *args, **kwargs)
  File "/usr/share/vdsm/storage/sp.py", line 1796, in createVolume
    desc=desc, srcImgUUID=srcImgUUID, srcVolUUID=srcVolUUID)
  File "/usr/share/vdsm/storage/sd.py", line 430, in createVolume
    preallocate, diskType, volUUID, desc, srcImgUUID, srcVolUUID)
  File "/usr/share/vdsm/storage/volume.py", line 375, in create
    imgPath = image.Image(repoPath).create(sdUUID, imgUUID)
  File "/usr/share/vdsm/storage/image.py", line 126, in create
    os.mkdir(imageDir)
OSError: [Errno 2] No such file or directory: '/rhev/data-center/00000002-0002-0002-0002-000000000200/b8ae5843-02e4-405b-a346-c4f1f9aa3494/images/5a939551-5e95-4405-8760-25886d81de20'

Comment 17 Ori Gofen 2014-09-09 09:05:28 UTC
Created attachment 935591 [details]
vdsm+engine logs

reproduced this bug using the exact steps to reproduce:
1.create template via py-sdk
2. repeat the creation with same name
3.try to create a template again using a different name(from ui or py-sdk)

first of all a weird error window appears(see image) which says literally nothing to the average user,the operation fails and not succeed.
The infinite loop issue is resolved

Comment 18 Ori Gofen 2014-09-09 09:07:26 UTC
what do you think?

Comment 19 Ori Gofen 2014-09-09 10:03:05 UTC
I have revised my NEEDINFO verifying this bug on rhev3.5 and opening new bugs
for current scenario.

Comment 20 Nir Soffer 2014-09-09 14:06:10 UTC
(In reply to Ori from comment #19)
> I have revised my NEEDINFO verifying this bug on rhev3.5 and opening 
> new bugs for current scenario.
Please attach the files you attached here to the other bugs.

It would be also nice if we have here reference to the other bugs.

Comment 21 Ori Gofen 2014-09-09 15:38:52 UTC
sure BZ #1139628,BZ #1139678

Comment 22 Sandro Bonazzola 2014-10-17 12:37:31 UTC
oVirt 3.5 has been released and should include the fix for this issue.