Bug 1781682

Summary: Create snapshots problems with a unique vm
Product: [oVirt] ovirt-engine Reporter: francisco.garcia
Component: Backup-Restore.VMsAssignee: Benny Zlotnik <bzlotnik>
Status: CLOSED NOTABUG QA Contact: Avihai <aefrat>
Severity: medium Docs Contact: bugs <bugs>
Priority: unspecified    
Version: 4.3.3.7CC: bugs, tnisan
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-12-29 08:34:29 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Storage RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Partial log of engine when I do a snapshot
none
Partial log of engine when I execute the virtual machine
none
vdsm.log
none
Full engine.log
none
vdsm-tool of iSCSI Storage
none
vdsm-tool NFS Storage
none
update.json
none
get Info volume
none
HostJobs
none
FindOlderLogs.txt
none
vdsm-update-volume log
none
Before/after change of the volume's metadata none

Description francisco.garcia 2019-12-10 11:40:41 UTC
Created attachment 1643633 [details]
Partial log of engine when I do a snapshot

Description of problem:
I am not able to do a snapshot to a specific VM. This VM has several disks, these are allocated in different storages and with different formats:
 - 4 Gb in NFS storage and THICK(RAW) format.
 - 1 Gb in iSCSI storage and THICK(RAW) format.
 - 1 Gb in iSCSI storage and THIN(COW) format.
 - 1 Gb in NFS storage and THIN(COW) format.

A week before, I could do a snapshot without problems. But now I can't it. I do not explain. Therefore I have been researching with the engine log and it seems that there is a problem with the volumes that are created, but I cannot find the exact cause and its solution. The volumes of disks aren't broken, so I can execute this VM without problems, and create files within it.

I have attached the portion of engine.log when I do a snapshot to this VM, and also I have attached when I execute it. Please, if you want any information, ask me it.


How reproducible:
I don't know how this point has been reached.


Actual results:
Error when create a snapshot with this VM.


Expected results:
Create a snapshot correctly with this VM.

Comment 1 francisco.garcia 2019-12-10 11:41:51 UTC
Created attachment 1643634 [details]
Partial log of engine when I execute the virtual machine

Comment 2 RHEL Program Management 2019-12-11 11:56:29 UTC
The documentation text flag should only be set after 'doc text' field is provided. Please provide the documentation text and set the flag to '?' again.

Comment 3 francisco.garcia 2019-12-11 15:19:27 UTC
Sorry, I activated it by mistake. I did not mean it was an issue of documentation. It's OK now?

Comment 4 Benny Zlotnik 2019-12-11 15:34:02 UTC
Please provide full engine and vdsm logs (SPM and the host where the VM is running)

Comment 5 francisco.garcia 2019-12-11 15:57:05 UTC
Created attachment 1644019 [details]
vdsm.log

Log of vdsm in hosts where vm is running

Comment 6 francisco.garcia 2019-12-11 15:57:45 UTC
Created attachment 1644020 [details]
Full engine.log

Full engine.log

Comment 7 francisco.garcia 2019-12-11 16:02:43 UTC
I attached full vdsm and engine logs. If you need more info or apply a action on the virtual machine, please ask me it.

Comment 8 Benny Zlotnik 2019-12-12 09:29:16 UTC
I see 
2019-12-11 16:50:41,941+0100 ERROR (tasks/3) [storage.Image] There is no leaf in the image 71c430cc-17e0-471d-827f-02703c1c318d (image:209)
2019-12-11 16:50:41,941+0100 WARN  (tasks/3) [storage.ResourceManager] Resource factory failed to create resource '01_img_adb25706-a077-443a-b0df-69721cfc9957.71c430cc-17e0-471d-827f-02703c1c318d'. Canceling request. (resourceManager:544)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/storage/resourceManager.py", line 540, in registerResource
    obj = namespaceObj.factory.createResource(name, lockType)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/resourceFactories.py", line 193, in createResource
    lockType)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/resourceFactories.py", line 122, in __getResourceCandidatesList
    imgUUID=resourceName)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/image.py", line 210, in getChain
    raise se.ImageIsNotLegalChain(imgUUID)
ImageIsNotLegalChain: Image is not a legal chain: (u'71c430cc-17e0-471d-827f-02703c1c318d',)

Looks like the chain may have been corrupted previously, can you share the output of:
`vdsm-tool dump-volum-chains <sd-id>`

Where <sd-id> is the storage domain where the snapshot's disk resides

Comment 9 francisco.garcia 2019-12-12 18:30:16 UTC
Thanks for your fast answer :)

I attached two outputs(one for each storage domain that share the disks of vm).

By the by, the command was: `vdsm-tool dump-volumE-chains <sd-id>` jejeje

Comment 10 francisco.garcia 2019-12-12 18:30:44 UTC
Created attachment 1644535 [details]
vdsm-tool of iSCSI Storage

Comment 11 francisco.garcia 2019-12-12 18:31:15 UTC
Created attachment 1644536 [details]
vdsm-tool NFS Storage

Comment 12 francisco.garcia 2019-12-12 18:32:40 UTC
The Image is in NFS Storage.

Comment 13 Benny Zlotnik 2019-12-15 09:07:17 UTC
   image:    71c430cc-17e0-471d-827f-02703c1c318d

             - 48b7b1dd-41f6-4259-a996-9038ad253d11
               status: OK, voltype: INTERNAL, format: RAW, legality: LEGAL, type: SPARSE, capacity: 1073741824, truesize: 222797824

looks like voltype is INTERNAL while it should be LEAF, since this is the only volume in the chain. This likely indicates a botched snapshot removal.
I don't see any indication of a snapshot removal in the logs though.

If you have older logs, lookup errors related to image 71c430cc-17e0-471d-827f-02703c1c318d
Theoretically, you could edit the volume's voltype, using `vdsm-client SDM update_volume`[1], but I am not sure what is the state of the volume, so hard to tell if it would work.
If you choose to do this, use it with caution and make sure you have everything backed up.




[1] https://bugzilla.redhat.com/show_bug.cgi?id=1409380

Comment 14 francisco.garcia 2019-12-16 11:33:05 UTC
Created attachment 1645549 [details]
update.json

JSON to edit voltype

Comment 15 francisco.garcia 2019-12-16 11:33:38 UTC
Created attachment 1645550 [details]
get Info volume

Comment 16 francisco.garcia 2019-12-16 11:34:15 UTC
Created attachment 1645551 [details]
HostJobs

Comment 17 francisco.garcia 2019-12-16 11:34:55 UTC
Created attachment 1645552 [details]
FindOlderLogs.txt

Comment 18 francisco.garcia 2019-12-16 11:35:57 UTC
Created attachment 1645553 [details]
vdsm-update-volume log

Comment 19 francisco.garcia 2019-12-16 11:41:43 UTC
Thanks for your help and your time.

In first hand, I have checked the older logs, but I didn't find any errors messages related to image: 71c430cc-17e0-471d-827f-02703c1c318d (Attach: FindOlderLogs.txt).

I have found several warnings messages with this format:
engine.log-20191121:
    2019-11-20 16:12:41,235+01 WARN  [org.ovirt.engine.core.bll.lock.InMemoryLockManager] (EE-ManagedThreadFactory-engineScheduled-Thread-45) [8fea739a-f977-4281-9ebd-e97fe646d4cf] Trying to release a shared lock for key: '71c430cc-17e0-471d-827f-02703c1c318dDISK' , but lock does not exist


In other hand, I have tried edit the volume's voltype with the method of [1], but it raise a Exception:
2019-12-16 11:35:50,681+0100 ERROR (tasks/9) [storage.Image] There is no leaf in the image 71c430cc-17e0-471d-827f-02703c1c318d (image:209)
2019-12-16 11:35:50,681+0100 WARN  (tasks/9) [storage.ResourceManager] Resource factory failed to create resource '01_img_adb25706-a077-443a-b0df-69721cfc9957.71c430cc-17e0-471d-827f-02703c1c318d'. Canceling request. (resourceManager:544)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/storage/resourceManager.py", line 540, in registerResource
    obj = namespaceObj.factory.createResource(name, lockType)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/resourceFactories.py", line 193, in createResource
    lockType)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/resourceFactories.py", line 122, in __getResourceCandidatesList
    imgUUID=resourceName)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/image.py", line 210, in getChain
    raise se.ImageIsNotLegalChain(imgUUID)



I attach several files:
 - update.json -> File to edit volume.
 - vdsm-update-volume log -> Log of vdsm when I applied: vdsm-client -f update.json SDM update_volume
 - HostJobs -> Get Jobs of host where is allocated the image.
 - get Info volume -> Info of volume with image 71c430cc-17e0-471d-827f-02703c1c318d

Any idea with this error?

Comment 20 Benny Zlotnik 2019-12-16 13:48:59 UTC
If that's the case I can only suggest directly editing the volume's metadata, it should be 48b7b1dd-41f6-4259-a996-9038ad253d11.meta file, but again, make sure you have backups

Comment 21 francisco.garcia 2019-12-28 20:23:14 UTC
Thanks for your help and your time. Sorry by delay.

I have edited directly the volume's metadata, and... It's works! =D

I attached the before/after edit of the volume's metadata.

One time more, thank you for your help =) although I don't understand very well the change I have made.

Comment 22 francisco.garcia 2019-12-28 20:23:58 UTC
Created attachment 1648256 [details]
Before/after change of the volume's metadata

Comment 23 Benny Zlotnik 2019-12-29 08:34:29 UTC
(In reply to francisco.garcia from comment #21)
> Thanks for your help and your time. Sorry by delay.
> 
> I have edited directly the volume's metadata, and... It's works! =D
> 
> I attached the before/after edit of the volume's metadata.
> 
> One time more, thank you for your help =) although I don't understand very
> well the change I have made.

Glad to hear it worked!

The simple explanation is that an attempted snapshot removal (which we have no logs of), corrupted the chain.
The chain consists of a single volume but it in its metadata its type is INTERNAL, which means it's a volume with a parent and a child, but it isn't correct in practice,
since it only has a parent. The last volume in the chain has a type LEAF, and thus vdsm block the operation because the chain didn't make sense, so you had to manually fix the metadata.

Since the issue is resolved, I will be closing this bug.