Bug 1461029 - Live merge fails when file based domain includes "images" in its path
Live merge fails when file based domain includes "images" in its path
Status: CLOSED CURRENTRELEASE
Product: vdsm
Classification: oVirt
Component: Core (Show other bugs)
4.19.15
x86_64 Linux
high Severity urgent (vote)
: ovirt-4.1.5
: 4.19.25
Assigned To: Ala Hino
Lilach Zitnitski
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2017-06-13 08:06 EDT by richardfalzini
Modified: 2017-08-23 04:03 EDT (History)
7 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: Storage server name or domain path contains "images". Consequence: When performing two consecutive live merge operation of the same VM, the second one will fail. Fix: Fix the logic that extracts the domain path to expect "images" in the image (disk) path. Result: Live merge merge will not fail.
Story Points: ---
Clone Of:
Environment:
Last Closed: 2017-08-23 04:03:31 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: Storage
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---
rule-engine: ovirt‑4.1+


Attachments (Terms of Use)
tar that collect all the log (981.30 KB, application/x-bzip)
2017-06-13 08:06 EDT, richardfalzini
no flags Details
log after second attempt to delete snapshot (1.65 MB, application/x-bzip)
2017-06-15 09:35 EDT, richardfalzini
no flags Details
change logger to DEBUG restarted vdsm and delete again s1 (1.47 MB, application/x-gzip)
2017-06-19 09:32 EDT, richardfalzini
no flags Details
new test from begin with log level set to DEBUG (5.04 MB, application/x-gzip)
2017-06-20 10:04 EDT, richardfalzini
no flags Details
gluster log (283.78 KB, application/x-gzip)
2017-06-21 10:22 EDT, richardfalzini
no flags Details
test with oVirt 4.1.3 (3.39 MB, application/x-gzip)
2017-07-06 12:17 EDT, richardfalzini
no flags Details
test within SPM (3.39 MB, application/x-gzip)
2017-07-17 08:06 EDT, richardfalzini
no flags Details
test13_InsideSPM (3.85 MB, application/x-gzip)
2017-07-19 12:01 EDT, richardfalzini
no flags Details
wf with patch 79622 (4.06 MB, application/x-gzip)
2017-07-20 09:57 EDT, richardfalzini
no flags Details
test 15 patch 2 (8.30 MB, application/x-gzip)
2017-07-20 13:03 EDT, richardfalzini
no flags Details
rpm -qa (24.67 KB, text/plain)
2017-07-20 13:06 EDT, richardfalzini
no flags Details


External Trackers
Tracker ID Priority Status Summary Last Updated
oVirt gerrit 79657 master MERGED fileVolume: Fix volume path spliting when domain or server path contains "images" 2017-07-25 04:20 EDT
oVirt gerrit 79790 ovirt-4.1 MERGED fileVolume: Fix volume path spliting when domain or server path contains "images" 2017-07-27 07:47 EDT

  None (edit)
Description richardfalzini 2017-06-13 08:06:06 EDT
Created attachment 1287243 [details]
tar that collect all the log

Description of problem:
impossible to delete snapshot

Version-Release number of selected component (if applicable):
vdsm 4.19.15
ovirt-engine 4.1.2.2
cluster compatibility 4.1
glusterfs 3.8.12

How reproducible:
create vm with 4 snashot

Steps to Reproduce:
1.delete one snapshot
2.delete one more
3.wait until the engine realize that the merge have failed

Actual results:
merge fail

Expected results:
merge complited without error

Additional info:

in vdsm.log:

i found 2 strange fact:
(1): this line below should report the list of img of the chain and is not reporting it.
2017-06-12 14:36:24,070+0200 INFO  (merge/3973ee62) [storage.Image] sdUUID=73ca2906-e59d-4c13-97f4-f636cad9fb0e imgUUID=09e08d20-6317-4418-9b69-9e5f396b64f9 chain=[<storage.glusterVolume.GlusterVolume object at 0x1bd37d0>, <storage.glusterVolume.GlusterVolume object at 0x24b3810>]  (image:285)


(2):
the message at the end of the traceback is refered to an immage that was deleted in the firt sanpshot remove.
VolumeDoesNotExist: Volume does not exist: ('94f463fa-f431-4a6a-b1f5-11b234161de3',)

the chain status are saved on (chain-before-delete1.txt after-delete-1.txt after-delete2.txt in the log.tar)

could it be something related to vdsm that is doing blockcommit and some module expect a blockpull?

in the log.tar

-engine.log
-vdsm.log
-sanlock.log
-libvirt.log
-glusterMountPoint.log
-chain-before-delete1.txt
-after-delete-1.txt
-after-delete2.txt
Comment 1 richardfalzini 2017-06-13 08:10:43 EDT
usefull additional info:

vmName=mergeWftest4
vmId=3973ee62-98ee-4732-8b63-c63adffd855d
disk_groupe_id=09e08d20-6317-4418-9b69-9e5f396b64f9
mountpointpath:/rhev/data-center/00000001-0001-0001-0001-0000000001a9/73ca2906-e59d-4c13-97f4-f636cad9fb0e/images/09e08d20-6317-4418-9b69-9e5f396b64f9
snapshot_disk_id:
 current = c54dd527-06ac-4790-bcb8-8c3881b05442
 s4 = dd3485a6-6843-4664-bbbc-b6db9f68eb58
 s3 = 94f463fa-f431-4a6a-b1f5-11b234161de3
 s2 = bd74e5dc-95fd-40e3-aeb0-5fef6b67d172
 s1 = 81b15267-721d-45ec-b9e2-061663febe5a

delete s2:

correlationId:	
c303cb38-bd04-4a44-a8c2-15f8608d4047

Jun 12, 2017 2:23:42 PM Snapshot 's2' deletion for VM 'mergeWftest4' was initiated by admin@internal-authz.	

Jun 12, 2017 2:25:14 PM Snapshot 's2' deletion for VM 'mergeWftest4' has been completed.
	
new situation:

snapshot_disk_id:
 current = c54dd527-06ac-4790-bcb8-8c3881b05442
 s4 = dd3485a6-6843-4664-bbbc-b6db9f68eb58
 s3 = bd74e5dc-95fd-40e3-aeb0-5fef6b67d172
 s1 = 81b15267-721d-45ec-b9e2-061663febe5a

start delate s2:
correlationID: 	
a9a0b2f0-ac6b-47f7-affa-590e119fd212

Jun 12, 2017 2:35:29 PM Snapshot 's1' deletion for VM 'mergeWftest4' was initiated by admin@internal-authz.
Comment 2 Ala Hino 2017-06-15 08:01:51 EDT
In the log I see an error indicating that the storage wasn't available.
Can you please try deleting the same snapshot again and update us with the results?
Comment 3 richardfalzini 2017-06-15 09:35 EDT
Created attachment 1288065 [details]
log after second attempt to delete snapshot

as request the log of the new delete attempt
Comment 4 Ala Hino 2017-06-19 04:51:44 EDT
I am looking at the chains in the attached files.

Original chain (file: chain-before-delete1.txt):

c54dd527-06ac-4790-bcb8-8c3881b05442 -> 
dd3485a6-6843-4664-bbbc-b6db9f68eb58 -> 
94f463fa-f431-4a6a-b1f5-11b234161de3 -> 
bd74e5dc-95fd-40e3-aeb0-5fef6b67d172 -> 
81b15267-721d-45ec-b9e2-061663febe5a

After deleting s3 (94f463fa-f431-4a6a-b1f5-11b234161de3), I see the expected chain (file: after-delete-1.txt):

c54dd527-06ac-4790-bcb8-8c3881b05442 ->
dd3485a6-6843-4664-bbbc-b6db9f68eb58 ->
bd74e5dc-95fd-40e3-aeb0-5fef6b67d172 ->
81b15267-721d-45ec-b9e2-061663febe5a

However, after deleting s2 (bd74e5dc-95fd-40e3-aeb0-5fef6b67d172), I see two chains (file: after-delete2.txt):

chain-1:
c54dd527-06ac-4790-bcb8-8c3881b05442 ->
dd3485a6-6843-4664-bbbc-b6db9f68eb58 ->
81b15267-721d-45ec-b9e2-061663febe5a

chain-2:
bd74e5dc-95fd-40e3-aeb0-5fef6b67d172 ->
81b15267-721d-45ec-b9e2-061663febe5a

chain-1 is what I expected to see. Trying to understand how we got chain-2.

Can you please send the current chain?

In addition, could you please send the volumes info by running following command on each volume (also on 94f463fa-f431-4a6a-b1f5-11b234161de3):

vdsm-client Volume getInfo storagepoolID=<spUUID> storagedomainID=<sdUUID> imageID=<imgUUID> volumeID=<volUUID>
Comment 5 Ala Hino 2017-06-19 05:48:45 EDT
One more request please, could you please set log level for storage and virt to DEBUG?

On the host, edit /etc/vdsm/logger.conf and change log level of logger_storage and logger_virt to DEBUG.

Thanks.
Comment 6 Ala Hino 2017-06-19 05:49:40 EDT
(In reply to Ala Hino from comment #5)
> One more request please, could you please set log level for storage and virt
> to DEBUG?
> 
> On the host, edit /etc/vdsm/logger.conf and change log level of
> logger_storage and logger_virt to DEBUG.
> 
> Thanks.

And of course try deleting the snapshot again and upload the logs.
Comment 7 richardfalzini 2017-06-19 08:11:29 EDT
(In reply to Ala Hino from comment #4)
> I am looking at the chains in the attached files.
> 
> Original chain (file: chain-before-delete1.txt):
> 
> c54dd527-06ac-4790-bcb8-8c3881b05442 -> 
> dd3485a6-6843-4664-bbbc-b6db9f68eb58 -> 
> 94f463fa-f431-4a6a-b1f5-11b234161de3 -> 
> bd74e5dc-95fd-40e3-aeb0-5fef6b67d172 -> 
> 81b15267-721d-45ec-b9e2-061663febe5a
> 
> After deleting s3 (94f463fa-f431-4a6a-b1f5-11b234161de3), I see the expected
> chain (file: after-delete-1.txt):
> 
> c54dd527-06ac-4790-bcb8-8c3881b05442 ->
> dd3485a6-6843-4664-bbbc-b6db9f68eb58 ->
> bd74e5dc-95fd-40e3-aeb0-5fef6b67d172 ->
> 81b15267-721d-45ec-b9e2-061663febe5a
> 
> However, after deleting s2 (bd74e5dc-95fd-40e3-aeb0-5fef6b67d172), I see two
> chains (file: after-delete2.txt):
> 
> chain-1:
> c54dd527-06ac-4790-bcb8-8c3881b05442 ->
> dd3485a6-6843-4664-bbbc-b6db9f68eb58 ->
> 81b15267-721d-45ec-b9e2-061663febe5a
> 
> chain-2:
> bd74e5dc-95fd-40e3-aeb0-5fef6b67d172 ->
> 81b15267-721d-45ec-b9e2-061663febe5a
> 
> chain-1 is what I expected to see. Trying to understand how we got chain-2.
> 
> Can you please send the current chain?

image: c54dd527-06ac-4790-bcb8-8c3881b05442
file format: qcow2
virtual size: 15G (16106127360 bytes)
disk size: 4.9G
cluster_size: 65536
backing file: dd3485a6-6843-4664-bbbc-b6db9f68eb58
backing file format: qcow2
Format specific information:
    compat: 1.1
    lazy refcounts: false
    refcount bits: 16
    corrupt: false

file format: qcow2
virtual size: 15G (16106127360 bytes)
disk size: 163M
cluster_size: 65536
backing file: 81b15267-721d-45ec-b9e2-061663febe5a
backing file format: raw
Format specific information:
    compat: 1.1
    lazy refcounts: false
    refcount bits: 16
    corrupt: false

file format: qcow2
virtual size: 15G (16106127360 bytes)
disk size: 1.3G
cluster_size: 65536
backing file: 81b15267-721d-45ec-b9e2-061663febe5a
backing file format: raw
Format specific information:
    compat: 1.1
    lazy refcounts: false
    refcount bits: 16
    corrupt: false

image: 81b15267-721d-45ec-b9e2-061663febe5a
file format: raw
virtual size: 15G (16106127360 bytes)
disk size: 6.0G

> 
> In addition, could you please send the volumes info by running following
> command on each volume (also on 94f463fa-f431-4a6a-b1f5-11b234161de3):
> 
the volume 94f463fa-f431-4a6a-b1f5-11b234161de3 do not exist since delete of s3..
ls on the folder /rhev/data-center/00000001-0001-0001-0001-0000000001a9/73ca2906-e59d-4c13-97f4-f636cad9fb0e/images/09e08d20-6317-4418-9b69-9e5f396b64f9


-rw-rw----  1 vdsm kvm  15G Jun 12 14:36 81b15267-721d-45ec-b9e2-061663febe5a
-rw-rw----  1 vdsm kvm 1.0M Jun  9 11:53 81b15267-721d-45ec-b9e2-061663febe5a.lease
-rw-r--r--  1 vdsm kvm  342 Jun 19 14:08 81b15267-721d-45ec-b9e2-061663febe5a.meta
-rw-rw----  1 vdsm kvm 1.3G Jun 12 14:24 bd74e5dc-95fd-40e3-aeb0-5fef6b67d172
-rw-rw----  1 vdsm kvm 1.0M Jun  9 13:26 bd74e5dc-95fd-40e3-aeb0-5fef6b67d172.lease
-rw-r--r--  1 vdsm kvm  265 Jun 12 14:24 bd74e5dc-95fd-40e3-aeb0-5fef6b67d172.meta
-rw-rw----  1 vdsm kvm 4.9G Jun 19 14:08 c54dd527-06ac-4790-bcb8-8c3881b05442
-rw-rw----  1 vdsm kvm 1.0M Jun  9 14:31 c54dd527-06ac-4790-bcb8-8c3881b05442.lease
-rw-r--r--  1 vdsm kvm  265 Jun  9 14:31 c54dd527-06ac-4790-bcb8-8c3881b05442.meta
-rw-rw----  1 vdsm kvm 163M Jun 12 14:36 dd3485a6-6843-4664-bbbc-b6db9f68eb58
-rw-rw----  1 vdsm kvm 1.0M Jun  9 14:25 dd3485a6-6843-4664-bbbc-b6db9f68eb58.lease
-rw-r--r--  1 vdsm kvm  269 Jun  9 14:31 dd3485a6-6843-4664-bbbc-b6db9f68eb58.meta


> vdsm-client Volume getInfo storagepoolID=<spUUID> storagedomainID=<sdUUID>
> imageID=<imgUUID> volumeID=<volUUID>

{
    "status": "OK", 
    "lease": {
        "owners": [], 
        "version": null
    }, 
    "domain": "73ca2906-e59d-4c13-97f4-f636cad9fb0e", 
    "capacity": "16106127360", 
    "voltype": "INTERNAL", 
    "description": "{\"DiskAlias\":\"mergeWftest4_Disk1\",\"DiskDescription\":\"mergeWftest4_Disk1\"}", 
    "parent": "00000000-0000-0000-0000-000000000000", 
    "format": "RAW", 
    "generation": 0, 
    "image": "09e08d20-6317-4418-9b69-9e5f396b64f9", 
    "uuid": "81b15267-721d-45ec-b9e2-061663febe5a", 
    "disktype": "2", 
    "legality": "LEGAL", 
    "mtime": "0", 
    "apparentsize": "16106127360", 
    "truesize": "6432301056", 
    "type": "SPARSE", 
    "children": [], 
    "pool": "", 
    "ctime": "1497001994"
}
{
    "status": "OK", 
    "lease": {
        "owners": [], 
        "version": null
    }, 
    "domain": "73ca2906-e59d-4c13-97f4-f636cad9fb0e", 
    "capacity": "16106127360", 
    "voltype": "LEAF", 
    "description": "", 
    "parent": "81b15267-721d-45ec-b9e2-061663febe5a", 
    "format": "COW", 
    "generation": 0, 
    "image": "09e08d20-6317-4418-9b69-9e5f396b64f9", 
    "uuid": "bd74e5dc-95fd-40e3-aeb0-5fef6b67d172", 
    "disktype": "2", 
    "legality": "LEGAL", 
    "mtime": "0", 
    "apparentsize": "1356070912", 
    "truesize": "1365323776", 
    "type": "SPARSE", 
    "children": [], 
    "pool": "", 
    "ctime": "1497007610"
}
{
    "status": "OK", 
    "lease": {
        "owners": [], 
        "version": null
    }, 
    "domain": "73ca2906-e59d-4c13-97f4-f636cad9fb0e", 
    "capacity": "16106127360", 
    "voltype": "LEAF", 
    "description": "", 
    "parent": "dd3485a6-6843-4664-bbbc-b6db9f68eb58", 
    "format": "COW", 
    "generation": 0, 
    "image": "09e08d20-6317-4418-9b69-9e5f396b64f9", 
    "uuid": "c54dd527-06ac-4790-bcb8-8c3881b05442", 
    "disktype": "2", 
    "legality": "LEGAL", 
    "mtime": "0", 
    "apparentsize": "5252448256", 
    "truesize": "5286686720", 
    "type": "SPARSE", 
    "children": [], 
    "pool": "", 
    "ctime": "1497011478"
}
{
    "status": "OK", 
    "lease": {
        "owners": [], 
        "version": null
    }, 
    "domain": "73ca2906-e59d-4c13-97f4-f636cad9fb0e", 
    "capacity": "16106127360", 
    "voltype": "INTERNAL", 
    "description": "", 
    "parent": "94f463fa-f431-4a6a-b1f5-11b234161de3", 
    "format": "COW", 
    "generation": 0, 
    "image": "09e08d20-6317-4418-9b69-9e5f396b64f9", 
    "uuid": "dd3485a6-6843-4664-bbbc-b6db9f68eb58", 
    "disktype": "2", 
    "legality": "LEGAL", 
    "mtime": "0", 
    "apparentsize": "170786816", 
    "truesize": "170655744", 
    "type": "SPARSE", 
    "children": [], 
    "pool": "", 
    "ctime": "1497011131"
}
Comment 8 Ala Hino 2017-06-19 09:20:37 EDT
Yeah, it is expected that 94f463fa-f431-4a6a-b1f5-11b234161de3  not to exist. The thing is that Vdsm metadata isn't updated. See volume info of dd3485a6-6843-4664-bbbc-b6db9f68eb58, its parent is 94f463fa-f431-4a6a-b1f5-11b234161de3  although doesn't exist. This btw explains the failure (VolumeDoesNotExist: Volume does not exist: ('94f463fa-f431-4a6a-b1f5-11b234161de3',)) you are getting each time you try to delete the snapshot.

Somehow, and this is what I am trying to figure out, Vdsm metadata isn't consistent with the storage - after first delete, dd3485a6-6843-4664-bbbc-b6db9f68eb58 parent should be bd74e5dc-95fd-40e3-aeb0-5fef6b67d172 and not 94f463fa-f431-4a6a-b1f5-11b234161de3.

Will you be able to try deleting the snapshot again while DEBUG level is turned on for storage and virt components?
Comment 9 richardfalzini 2017-06-19 09:32 EDT
Created attachment 1289106 [details]
change logger to DEBUG restarted vdsm and delete again s1

(In reply to Ala Hino from comment #6)
> (In reply to Ala Hino from comment #5)
> > One more request please, could you please set log level for storage and virt
> > to DEBUG?
> > 
> > On the host, edit /etc/vdsm/logger.conf and change log level of
> > logger_storage and logger_virt to DEBUG.
> > 
> > Thanks.
> 
> And of course try deleting the snapshot again and upload the logs.
Comment 10 richardfalzini 2017-06-20 10:04 EDT
Created attachment 1289653 [details]
new test from begin with log level set to DEBUG

to help you with the debug i made a new test from scratch with a new vm and the log level set to DEBUG.

in case you need info on the gluster configuration and status please let me know. 


vmName=mergeWftest5
vmId=80d20eee-4a95-40f4-bdc6-c9b3581ba3dd
disk_groupe_id=52b5d0ba-a458-4b7f-b18c-82f990af52aa
mountpointpath:/rhev/data-center/00000001-0001-0001-0001-0000000001a9/73ca2906-e59d-4c13-97f4-f636cad9fb0e/images/52b5d0ba-a458-4b7f-b18c-82f990af52aa

snapshot_disk_id:
 current = 6c9d1f32-fa4f-4d77-97ca-ef41604569c4 
 s4 = 423847b4-2a6f-428d-8e95-73cd5f0177dc
 s3 = e0d4ba92-7c18-42c8-9336-2d7a63470178
 s2 = fad94d5e-7db2-458e-94a3-921c3db937c9
 s1 = b004c421-47c4-425b-bb0f-d21bab267c6b

check the chain qemu-img info:
file qemu-info-chain1.txt

vdsm-client Volume getInfo:
file volumeInfo1.txt 


now start delate s1:

correlationId:	ee29a99a-8a5f-4448-858d-dbcced68388c

Jun 20, 2017 12:08:56 PM Snapshot 's1' deletion for VM 'mergeWftest5' was initiated by admin@internal-authz.

Jun 20, 2017 12:11:30 PM Snapshot 's1' deletion for VM 'mergeWftest5' has been completed.

new situation:

snapshot_disk_id:
 current = 6c9d1f32-fa4f-4d77-97ca-ef41604569c4
 s4 = 423847b4-2a6f-428d-8e95-73cd5f0177dc
 s3 = e0d4ba92-7c18-42c8-9336-2d7a63470178
 s2 = b004c421-47c4-425b-bb0f-d21bab267c6b



check the chain qemu-img info:
file qemu-info-chain2.txt

vdsm-client Volume getInfo:
file volumeInfo2.txt 


update ovs_storage

now start delate s2:

correlationID: 5f1e2cde-21d4-4285-9c0a-51a5f93b9ab0

Jun 20, 2017 12:19:46 PM Snapshot 's2' deletion for VM 'mergeWftest5' was initiated by admin@internal-authz.


check the chain qemu-img info:
file qemu-info-chain3.txt

vdsm-client Volume getInfo:
file volumeInfo3.txt
Comment 11 Ala Hino 2017-06-20 10:21:00 EDT
Much appreciated. Thanks.

Please do upload gluster configuration.
Comment 12 richardfalzini 2017-06-20 10:44:36 EDT
Hope this are all the info that you need, but if you need more let me know ill give it to you asap. 

Thanks for the help.

Gluster info:
#################################
#################################
#gluster --version

glusterfs 3.8.12 built on May 11 2017 18:46:22

#################################
#################################
#gluster volume status vm-images-repo-demo

Status of volume: vm-images-repo-demo
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick compute-0-0:/data/glusterfs/brick2/vm
-images-repo-demo                           49162     0          Y       2139 
Brick compute-0-1:/data/glusterfs/brick2/vm
-images-repo-demo                           49158     0          Y       2462 
Brick compute-0-2:/data/glusterfs/brick2/vm
-images-repo-demo                           49158     0          Y       2896 
Brick compute-0-0:/data/glusterfs/brick3/vm
-images-repo-demo                           49163     0          Y       2156 
Brick compute-0-1:/data/glusterfs/brick3/vm
-images-repo-demo                           49159     0          Y       2457 
Brick compute-0-2:/data/glusterfs/brick3/vm
-images-repo-demo                           49159     0          Y       2902 
Brick compute-0-0:/data/glusterfs/brick4/vm
-images-repo-demo                           49165     0          Y       2254 
Brick compute-0-3:/data/glusterfs/brick2/vm
-images-repo-demo                           49152     0          Y       3868 
Brick compute-0-4:/data/glusterfs/brick2/vm
-images-repo-demo                           49152     0          Y       2706 
Brick compute-0-1:/data/glusterfs/brick4/vm
-images-repo-demo                           49161     0          Y       2446 
Brick compute-0-3:/data/glusterfs/brick3/vm
-images-repo-demo                           49153     0          Y       3862 
Brick compute-0-4:/data/glusterfs/brick3/vm
-images-repo-demo                           49153     0          Y       2687 
Brick compute-0-2:/data/glusterfs/brick4/vm
-images-repo-demo                           49161     0          Y       2910 
Brick compute-0-3:/data/glusterfs/brick4/vm
-images-repo-demo                           49154     0          Y       3855 
Brick compute-0-4:/data/glusterfs/brick4/vm
-images-repo-demo                           49154     0          Y       2699 
Self-heal Daemon on localhost               N/A       N/A        Y       13984
Self-heal Daemon on compute-0-2             N/A       N/A        Y       8653 
Self-heal Daemon on compute-0-0             N/A       N/A        Y       11348
Self-heal Daemon on compute-0-3             N/A       N/A        Y       6701 
Self-heal Daemon on compute-0-4             N/A       N/A        Y       12306
 
Task Status of Volume vm-images-repo-demo
------------------------------------------------------------------------------
Task                 : Rebalance           
ID                   : ac0c2ff9-927e-40d7-bc68-f8fa54de6049
Status               : completed           
 
#############################################################
#############################################################
##gluster peer status

Number of Peers: 4

Hostname: compute-0-3
Uuid: ef2132f5-f4c8-4f6d-ab06-62a3e025372a
State: Peer in Cluster (Connected)
Other names:
192.168.3.104

Hostname: compute-0-2
Uuid: fb11cda4-e716-4e30-85c5-9cbf73466b01
State: Peer in Cluster (Connected)
Other names:
192.168.3.103

Hostname: compute-0-4
Uuid: fc02d6bb-96b3-4d63-89d1-a277a5ef26db
State: Peer in Cluster (Connected)
Other names:
192.168.3.105

Hostname: compute-0-0
Uuid: d286dbdf-f6e9-4c9a-af92-82d71a5f2d51
State: Peer in Cluster (Connected)
Other names:
192.168.3.101


################################
###############################
#gluster volume get vm-images-repo-demo all


Option                                  Value                                   
------                                  -----                                   
cluster.lookup-unhashed                 on                                      
cluster.lookup-optimize                 off                                     
cluster.min-free-disk                   10%                                     
cluster.min-free-inodes                 5%                                      
cluster.rebalance-stats                 off                                     
cluster.subvols-per-directory           (null)                                  
cluster.readdir-optimize                off                                     
cluster.rsync-hash-regex                (null)                                  
cluster.extra-hash-regex                (null)                                  
cluster.dht-xattr-name                  trusted.glusterfs.dht                   
cluster.randomize-hash-range-by-gfid    off                                     
cluster.rebal-throttle                  normal                                  
cluster.lock-migration                  off                                     
cluster.local-volume-name               (null)                                  
cluster.weighted-rebalance              on                                      
cluster.switch-pattern                  (null)                                  
cluster.entry-change-log                on                                      
cluster.read-subvolume                  (null)                                  
cluster.read-subvolume-index            -1                                      
cluster.read-hash-mode                  1                                       
cluster.background-self-heal-count      8                                       
cluster.metadata-self-heal              on                                      
cluster.data-self-heal                  on                                      
cluster.entry-self-heal                 on                                      
cluster.self-heal-daemon                on                                      
cluster.heal-timeout                    600                                     
cluster.self-heal-window-size           1                                       
cluster.data-change-log                 on                                      
cluster.metadata-change-log             on                                      
cluster.data-self-heal-algorithm        full                                    
cluster.eager-lock                      enable                                  
disperse.eager-lock                     on                                      
cluster.quorum-type                     auto                                    
cluster.quorum-count                    (null)                                  
cluster.choose-local                    true                                    
cluster.self-heal-readdir-size          1KB                                     
cluster.post-op-delay-secs              1                                       
cluster.ensure-durability               on                                      
cluster.consistent-metadata             no                                      
cluster.heal-wait-queue-length          128                                     
cluster.favorite-child-policy           none                                    
cluster.stripe-block-size               128KB                                   
cluster.stripe-coalesce                 true                                    
diagnostics.latency-measurement         off                                     
diagnostics.dump-fd-stats               off                                     
diagnostics.count-fop-hits              off                                     
diagnostics.brick-log-level             INFO                                    
diagnostics.client-log-level            INFO                                    
diagnostics.brick-sys-log-level         CRITICAL                                
diagnostics.client-sys-log-level        CRITICAL                                
diagnostics.brick-logger                (null)                                  
diagnostics.client-logger               (null)                                  
diagnostics.brick-log-format            (null)                                  
diagnostics.client-log-format           (null)                                  
diagnostics.brick-log-buf-size          5                                       
diagnostics.client-log-buf-size         5                                       
diagnostics.brick-log-flush-timeout     120                                     
diagnostics.client-log-flush-timeout    120                                     
diagnostics.stats-dump-interval         0                                       
diagnostics.fop-sample-interval         0                                       
diagnostics.fop-sample-buf-size         65535                                   
diagnostics.stats-dnscache-ttl-sec      86400                                   
performance.cache-max-file-size         0                                       
performance.cache-min-file-size         0                                       
performance.cache-refresh-timeout       1                                       
performance.cache-priority                                                      
performance.cache-size                  32MB                                    
performance.io-thread-count             16                                      
performance.high-prio-threads           16                                      
performance.normal-prio-threads         16                                      
performance.low-prio-threads            32                                      
performance.least-prio-threads          1                                       
performance.enable-least-priority       on                                      
performance.least-rate-limit            0                                       
performance.cache-size                  128MB                                   
performance.flush-behind                on                                      
performance.nfs.flush-behind            on                                      
performance.write-behind-window-size    1MB                                     
performance.resync-failed-syncs-after-fsyncoff                                     
performance.nfs.write-behind-window-size1MB                                     
performance.strict-o-direct             off                                     
performance.nfs.strict-o-direct         off                                     
performance.strict-write-ordering       off                                     
performance.nfs.strict-write-ordering   off                                     
performance.lazy-open                   yes                                     
performance.read-after-open             no                                      
performance.read-ahead-page-count       4                                       
performance.md-cache-timeout            1                                       
performance.cache-swift-metadata        true                                    
features.encryption                     off                                     
encryption.master-key                   (null)                                  
encryption.data-key-size                256                                     
encryption.block-size                   4096                                    
network.frame-timeout                   1800                                    
network.ping-timeout                    10                                      
network.tcp-window-size                 (null)                                  
features.lock-heal                      off                                     
features.grace-timeout                  10                                      
network.remote-dio                      enable                                  
client.event-threads                    2                                       
network.ping-timeout                    10                                      
network.tcp-window-size                 (null)                                  
network.inode-lru-limit                 16384                                   
auth.allow                              *                                       
auth.reject                             (null)                                  
transport.keepalive                     (null)                                  
server.allow-insecure                   on                                      
server.root-squash                      off                                     
server.anonuid                          65534                                   
server.anongid                          65534                                   
server.statedump-path                   /var/run/gluster                        
server.outstanding-rpc-limit            64                                      
features.lock-heal                      off                                     
features.grace-timeout                  10                                      
server.ssl                              (null)                                  
auth.ssl-allow                          *                                       
server.manage-gids                      off                                     
server.dynamic-auth                     on                                      
client.send-gids                        on                                      
server.gid-timeout                      300                                     
server.own-thread                       (null)                                  
server.event-threads                    2                                       
ssl.own-cert                            (null)                                  
ssl.private-key                         (null)                                  
ssl.ca-list                             (null)                                  
ssl.crl-path                            (null)                                  
ssl.certificate-depth                   (null)                                  
ssl.cipher-list                         (null)                                  
ssl.dh-param                            (null)                                  
ssl.ec-curve                            (null)                                  
performance.write-behind                on                                      
performance.read-ahead                  off                                     
performance.readdir-ahead               on                                      
performance.io-cache                    off                                     
performance.quick-read                  off                                     
performance.open-behind                 on                                      
performance.stat-prefetch               off                                     
performance.client-io-threads           off                                     
performance.nfs.write-behind            on                                      
performance.nfs.read-ahead              off                                     
performance.nfs.io-cache                off                                     
performance.nfs.quick-read              off                                     
performance.nfs.stat-prefetch           off                                     
performance.nfs.io-threads              off                                     
performance.force-readdirp              true                                    
features.uss                            off                                     
features.snapshot-directory             .snaps                                  
features.show-snapshot-directory        off                                     
network.compression                     off                                     
network.compression.window-size         -15                                     
network.compression.mem-level           8                                       
network.compression.min-size            0                                       
network.compression.compression-level   -1                                      
network.compression.debug               false                                   
features.limit-usage                    (null)                                  
features.quota-timeout                  0                                       
features.default-soft-limit             80%                                     
features.soft-timeout                   60                                      
features.hard-timeout                   5                                       
features.alert-time                     86400                                   
features.quota-deem-statfs              off                                     
geo-replication.indexing                off                                     
geo-replication.indexing                off                                     
geo-replication.ignore-pid-check        off                                     
geo-replication.ignore-pid-check        off                                     
features.quota                          off                                     
features.inode-quota                    off                                     
features.bitrot                         disable                                 
debug.trace                             off                                     
debug.log-history                       no                                      
debug.log-file                          no                                      
debug.exclude-ops                       (null)                                  
debug.include-ops                       (null)                                  
debug.error-gen                         off                                     
debug.error-failure                     (null)                                  
debug.error-number                      (null)                                  
debug.random-failure                    off                                     
debug.error-fops                        (null)                                  
nfs.enable-ino32                        no                                      
nfs.mem-factor                          15                                      
nfs.export-dirs                         on                                      
nfs.export-volumes                      on                                      
nfs.addr-namelookup                     off                                     
nfs.dynamic-volumes                     off                                     
nfs.register-with-portmap               on                                      
nfs.outstanding-rpc-limit               16                                      
nfs.port                                2049                                    
nfs.rpc-auth-unix                       on                                      
nfs.rpc-auth-null                       on                                      
nfs.rpc-auth-allow                      all                                     
nfs.rpc-auth-reject                     none                                    
nfs.ports-insecure                      off                                     
nfs.trusted-sync                        off                                     
nfs.trusted-write                       off                                     
nfs.volume-access                       read-write                              
nfs.export-dir                                                                  
nfs.disable                             true                                    
nfs.nlm                                 on                                      
nfs.acl                                 on                                      
nfs.mount-udp                           off                                     
nfs.mount-rmtab                         /var/lib/glusterd/nfs/rmtab             
nfs.rpc-statd                           /sbin/rpc.statd                         
nfs.server-aux-gids                     off                                     
nfs.drc                                 off                                     
nfs.drc-size                            0x20000                                 
nfs.read-size                           (1 * 1048576ULL)                        
nfs.write-size                          (1 * 1048576ULL)                        
nfs.readdir-size                        (1 * 1048576ULL)                        
nfs.rdirplus                            on                                      
nfs.exports-auth-enable                 (null)                                  
nfs.auth-refresh-interval-sec           (null)                                  
nfs.auth-cache-ttl-sec                  (null)                                  
features.read-only                      off                                     
features.worm                           off                                     
features.worm-file-level                off                                     
features.default-retention-period       120                                     
features.retention-mode                 relax                                   
features.auto-commit-period             180                                     
storage.linux-aio                       off                                     
storage.batch-fsync-mode                reverse-fsync                           
storage.batch-fsync-delay-usec          0                                       
storage.owner-uid                       36                                      
storage.owner-gid                       36                                      
storage.node-uuid-pathinfo              off                                     
storage.health-check-interval           30                                      
storage.build-pgfid                     off                                     
storage.bd-aio                          off                                     
cluster.server-quorum-type              server                                  
cluster.server-quorum-ratio             0                                       
changelog.changelog                     off                                     
changelog.changelog-dir                 (null)                                  
changelog.encoding                      ascii                                   
changelog.rollover-time                 15                                      
changelog.fsync-interval                5                                       
changelog.changelog-barrier-timeout     120                                     
changelog.capture-del-path              off                                     
features.barrier                        disable                                 
features.barrier-timeout                120                                     
features.trash                          off                                     
features.trash-dir                      .trashcan                               
features.trash-eliminate-path           (null)                                  
features.trash-max-filesize             5MB                                     
features.trash-internal-op              off                                     
cluster.enable-shared-storage           disable                                 
cluster.write-freq-threshold            0                                       
cluster.read-freq-threshold             0                                       
cluster.tier-pause                      off                                     
cluster.tier-promote-frequency          120                                     
cluster.tier-demote-frequency           3600                                    
cluster.watermark-hi                    90                                      
cluster.watermark-low                   75                                      
cluster.tier-mode                       cache                                   
cluster.tier-max-promote-file-size      0                                       
cluster.tier-max-mb                     4000                                    
cluster.tier-max-files                  10000                                   
features.ctr-enabled                    off                                     
features.record-counters                off                                     
features.ctr-record-metadata-heat       off                                     
features.ctr_link_consistency           off                                     
features.ctr_lookupheal_link_timeout    300                                     
features.ctr_lookupheal_inode_timeout   300                                     
features.ctr-sql-db-cachesize           1000                                    
features.ctr-sql-db-wal-autocheckpoint  1000                                    
locks.trace                             off                                     
locks.mandatory-locking                 off                                     
cluster.disperse-self-heal-daemon       enable                                  
cluster.quorum-reads                    no                                      
client.bind-insecure                    (null)                                  
ganesha.enable                          off                                     
features.shard                          on                                      
features.shard-block-size               4MB                                     
features.scrub-throttle                 lazy                                    
features.scrub-freq                     biweekly                                
features.scrub                          false                                   
features.expiry-time                    120                                     
features.cache-invalidation             off                                     
features.cache-invalidation-timeout     60                                      
features.leases                         off                                     
features.lease-lock-recall-timeout      60                                      
disperse.background-heals               8                                       
disperse.heal-wait-qlength              128                                     
cluster.heal-timeout                    600                                     
dht.force-readdirp                      on                                      
disperse.read-policy                    round-robin                             
cluster.shd-max-threads                 8                                       
cluster.shd-wait-qlength                10000                                   
cluster.locking-scheme                  granular                                
cluster.granular-entry-heal             no
Comment 13 Ala Hino 2017-06-21 06:45:14 EDT
I see that you create replica 5 volumes.

For the sake of debug, can you try the following?

1. create a replica 1 volume (on gluster server)
2. create a gluster storage domain (sd) with the replica 1 volume created before
3. create a new VM with disk on that gluster sd (no need to install OS nor copy data)
4. create 4 snapshots and delete them as done before

Thanks!
Comment 14 Ala Hino 2017-06-21 07:05:33 EDT
Can you also please send gluster logs under /var/logs/glusterfs? There is a cli.log file and the volume(s) logs.
Comment 15 richardfalzini 2017-06-21 10:22 EDT
Created attachment 1290121 [details]
gluster log

(In reply to Ala Hino from comment #13)
> I see that you create replica 5 volumes.
> 
> For the sake of debug, can you try the following?
> 
> 1. create a replica 1 volume (on gluster server)
> 2. create a gluster storage domain (sd) with the replica 1 volume created
> before
> 3. create a new VM with disk on that gluster sd (no need to install OS nor
> copy data)
> 4. create 4 snapshots and delete them as done before
> 

The vm-images-repo-demo volume, in witch the vm image is located, is configured as replica 3 volume.
on the attachment you can find the gluster log from the test done before.
Comment 16 Ala Hino 2017-06-22 05:04:20 EDT
Can you please send the content of the following files:

/rhev/data-center/00000001-0001-0001-0001-0000000001a9/73ca2906-e59d-4c13-97f4-f636cad9fb0e/images/09e08d20-6317-4418-9b69-9e5f396b64f9/c54dd527-06ac-4790-bcb8-8c3881b05442.meta

/rhev/data-center/00000001-0001-0001-0001-0000000001a9/73ca2906-e59d-4c13-97f4-f636cad9fb0e/images/09e08d20-6317-4418-9b69-9e5f396b64f9/dd3485a6-6843-4664-bbbc-b6db9f68eb58.meta

/rhev/data-center/00000001-0001-0001-0001-0000000001a9/73ca2906-e59d-4c13-97f4-f636cad9fb0e/images/09e08d20-6317-4418-9b69-9e5f396b64f9/bd74e5dc-95fd-40e3-aeb0-5fef6b67d172.meta

/rhev/data-center/00000001-0001-0001-0001-0000000001a9/73ca2906-e59d-4c13-97f4-f636cad9fb0e/images/09e08d20-6317-4418-9b69-9e5f396b64f9/81b15267-721d-45ec-b9e2-061663febe5a.meta
Comment 17 richardfalzini 2017-06-22 06:34:46 EDT
(In reply to Ala Hino from comment #16)
> Can you please send the content of the following files:
> 
> /rhev/data-center/00000001-0001-0001-0001-0000000001a9/73ca2906-e59d-4c13-
> 97f4-f636cad9fb0e/images/09e08d20-6317-4418-9b69-9e5f396b64f9/c54dd527-06ac-
> 4790-bcb8-8c3881b05442.meta
> 

DOMAIN=73ca2906-e59d-4c13-97f4-f636cad9fb0e
CTIME=1497011478
FORMAT=COW
DISKTYPE=2
LEGALITY=LEGAL
SIZE=31457280
VOLTYPE=LEAF
DESCRIPTION=
IMAGE=09e08d20-6317-4418-9b69-9e5f396b64f9
PUUID=dd3485a6-6843-4664-bbbc-b6db9f68eb58
MTIME=0
POOL_UUID=
TYPE=SPARSE
GEN=0
EOF

> /rhev/data-center/00000001-0001-0001-0001-0000000001a9/73ca2906-e59d-4c13-
> 97f4-f636cad9fb0e/images/09e08d20-6317-4418-9b69-9e5f396b64f9/dd3485a6-6843-
> 4664-bbbc-b6db9f68eb58.meta
> 

DOMAIN=73ca2906-e59d-4c13-97f4-f636cad9fb0e
CTIME=1497011131
FORMAT=COW
DISKTYPE=2
LEGALITY=LEGAL
SIZE=31457280
VOLTYPE=INTERNAL
DESCRIPTION=
IMAGE=09e08d20-6317-4418-9b69-9e5f396b64f9
PUUID=94f463fa-f431-4a6a-b1f5-11b234161de3
MTIME=0
POOL_UUID=
TYPE=SPARSE
GEN=0
EOF

> /rhev/data-center/00000001-0001-0001-0001-0000000001a9/73ca2906-e59d-4c13-
> 97f4-f636cad9fb0e/images/09e08d20-6317-4418-9b69-9e5f396b64f9/bd74e5dc-95fd-
> 40e3-aeb0-5fef6b67d172.meta
> 


DOMAIN=73ca2906-e59d-4c13-97f4-f636cad9fb0e
CTIME=1497007610
FORMAT=COW
DISKTYPE=2
LEGALITY=LEGAL
SIZE=31457280
VOLTYPE=LEAF
DESCRIPTION=
IMAGE=09e08d20-6317-4418-9b69-9e5f396b64f9
PUUID=81b15267-721d-45ec-b9e2-061663febe5a
MTIME=0
POOL_UUID=
TYPE=SPARSE
GEN=0
EOF


> /rhev/data-center/00000001-0001-0001-0001-0000000001a9/73ca2906-e59d-4c13-
> 97f4-f636cad9fb0e/images/09e08d20-6317-4418-9b69-9e5f396b64f9/81b15267-721d-
> 45ec-b9e2-061663febe5a.meta


DOMAIN=73ca2906-e59d-4c13-97f4-f636cad9fb0e
CTIME=1497001994
FORMAT=RAW
DISKTYPE=2
LEGALITY=LEGAL
SIZE=31457280
VOLTYPE=INTERNAL
DESCRIPTION={"DiskAlias":"mergeWftest4_Disk1","DiskDescription":"mergeWftest4_Disk1"}
IMAGE=09e08d20-6317-4418-9b69-9e5f396b64f9
PUUID=00000000-0000-0000-0000-000000000000
MTIME=0
POOL_UUID=
TYPE=SPARSE
GEN=0
EOF
Comment 18 Ala Hino 2017-06-22 09:17:33 EDT
Thanks.

I need qume-img info again of the original volumes (the image name was missing from what you sent before) - the output of the following commands:

qemu-img info /rhev/data-center/00000001-0001-0001-0001-0000000001a9/73ca2906-e59d-4c13-97f4-f636cad9fb0e/images/09e08d20-6317-4418-9b69-9e5f396b64f9/c54dd527-06ac-4790-bcb8-8c3881b05442

qemu-img info /rhev/data-center/00000001-0001-0001-0001-0000000001a9/73ca2906-e59d-4c13-97f4-f636cad9fb0e/images/09e08d20-6317-4418-9b69-9e5f396b64f9/dd3485a6-6843-4664-bbbc-b6db9f68eb58

qemu-img info /rhev/data-center/00000001-0001-0001-0001-0000000001a9/73ca2906-e59d-4c13-97f4-f636cad9fb0e/images/09e08d20-6317-4418-9b69-9e5f396b64f9/bd74e5dc-95fd-40e3-aeb0-5fef6b67d172

qemu-img info /rhev/data-center/00000001-0001-0001-0001-0000000001a9/73ca2906-e59d-4c13-97f4-f636cad9fb0e/images/09e08d20-6317-4418-9b69-9e5f396b64f9/81b15267-721d-45ec-b9e2-061663febe5a

Please make sure to include the image and the backing file.

Thanks.
Comment 19 richardfalzini 2017-06-22 10:23:14 EDT
(In reply to Ala Hino from comment #18)
> Thanks.
> 
> I need qume-img info again of the original volumes (the image name was
> missing from what you sent before) - the output of the following commands:
> 
> qemu-img info
> /rhev/data-center/00000001-0001-0001-0001-0000000001a9/73ca2906-e59d-4c13-
> 97f4-f636cad9fb0e/images/09e08d20-6317-4418-9b69-9e5f396b64f9/c54dd527-06ac-
> 4790-bcb8-8c3881b05442
> 
image: /rhev/data-center/00000001-0001-0001-0001-0000000001a9/73ca2906-e59d-4c13-97f4-f636cad9fb0e/images/09e08d20-6317-4418-9b69-9e5f396b64f9/c54dd527-06ac-4790-bcb8-8c3881b05442
file format: qcow2
virtual size: 15G (16106127360 bytes)
disk size: 6.0G
cluster_size: 65536
backing file: dd3485a6-6843-4664-bbbc-b6db9f68eb58 (actual path: /rhev/data-center/00000001-0001-0001-0001-0000000001a9/73ca2906-e59d-4c13-97f4-f636cad9fb0e/images/09e08d20-6317-4418-9b69-9e5f396b64f9/dd3485a6-6843-4664-bbbc-b6db9f68eb58)
backing file format: qcow2
Format specific information:
    compat: 1.1
    lazy refcounts: false
    refcount bits: 16
    corrupt: false


> qemu-img info
> /rhev/data-center/00000001-0001-0001-0001-0000000001a9/73ca2906-e59d-4c13-
> 97f4-f636cad9fb0e/images/09e08d20-6317-4418-9b69-9e5f396b64f9/dd3485a6-6843-
> 4664-bbbc-b6db9f68eb58
>

image: /rhev/data-center/00000001-0001-0001-0001-0000000001a9/73ca2906-e59d-4c13-97f4-f636cad9fb0e/images/09e08d20-6317-4418-9b69-9e5f396b64f9/dd3485a6-6843-4664-bbbc-b6db9f68eb58
file format: qcow2
virtual size: 15G (16106127360 bytes)
disk size: 163M
cluster_size: 65536
backing file: 81b15267-721d-45ec-b9e2-061663febe5a (actual path: /rhev/data-center/00000001-0001-0001-0001-0000000001a9/73ca2906-e59d-4c13-97f4-f636cad9fb0e/images/09e08d20-6317-4418-9b69-9e5f396b64f9/81b15267-721d-45ec-b9e2-061663febe5a)
backing file format: raw
Format specific information:
    compat: 1.1
    lazy refcounts: false
    refcount bits: 16
    corrupt: false

 
> qemu-img info
> /rhev/data-center/00000001-0001-0001-0001-0000000001a9/73ca2906-e59d-4c13-
> 97f4-f636cad9fb0e/images/09e08d20-6317-4418-9b69-9e5f396b64f9/bd74e5dc-95fd-
> 40e3-aeb0-5fef6b67d172
> 

image: /rhev/data-center/00000001-0001-0001-0001-0000000001a9/73ca2906-e59d-4c13-97f4-f636cad9fb0e/images/09e08d20-6317-4418-9b69-9e5f396b64f9/bd74e5dc-95fd-40e3-aeb0-5fef6b67d172
file format: qcow2
virtual size: 15G (16106127360 bytes)
disk size: 1.3G
cluster_size: 65536
backing file: 81b15267-721d-45ec-b9e2-061663febe5a (actual path: /rhev/data-center/00000001-0001-0001-0001-0000000001a9/73ca2906-e59d-4c13-97f4-f636cad9fb0e/images/09e08d20-6317-4418-9b69-9e5f396b64f9/81b15267-721d-45ec-b9e2-061663febe5a)
backing file format: raw
Format specific information:
    compat: 1.1
    lazy refcounts: false
    refcount bits: 16
    corrupt: false


> qemu-img info
> /rhev/data-center/00000001-0001-0001-0001-0000000001a9/73ca2906-e59d-4c13-
> 97f4-f636cad9fb0e/images/09e08d20-6317-4418-9b69-9e5f396b64f9/81b15267-721d-
> 45ec-b9e2-061663febe5a
>

image: /rhev/data-center/00000001-0001-0001-0001-0000000001a9/73ca2906-e59d-4c13-97f4-f636cad9fb0e/images/09e08d20-6317-4418-9b69-9e5f396b64f9/81b15267-721d-45ec-b9e2-061663febe5a
file format: raw
virtual size: 15G (16106127360 bytes)
disk size: 6.0G
 
> Please make sure to include the image and the backing file.
> 
> Thanks.
 
Thanks.
Comment 20 Allon Mureinik 2017-07-02 16:37:47 EDT
4.1.4 is planned as a minimal, fast, z-stream version to fix any open issues we may have in supporting the upcoming EL 7.4.

Pushing out anything unrelated, although if there's a minimal/trival, SAFE fix that's ready on time, we can consider introducing it in 4.1.4.
Comment 21 richardfalzini 2017-07-06 12:17 EDT
Created attachment 1294988 [details]
test with oVirt 4.1.3

I updated to oVirt 4.1.3, and I tried a new test.
Unfortunately the results are the same. I collected all the useful information in the tar.
Please tell me if there is anything I can do to help find and solve the problem.
thank you so much



vmName=	mergeWftest9
vmId= 02c854d2-fade-44a3-a215-1d81bc68108a
disk_groupe_id= bfdbd9cb-dcc6-4278-84bb-7a08be7fbfcf
mountpointpath: /rhev/data-center/00000001-0001-0001-0001-0000000001a9/73ca2906-e59d-4c13-97f4-f636cad9fb0e/images/bfdbd9cb-dcc6-4278-84bb-7a08be7fbfcf


snapshot_disk_id:
 current = 5abffd41-d079-47de-8dc6-b2aba94f801b
 s4 = 4c6c9cb7-a280-4ce9-96a4-2f7f4fb52137
 s3 = 6478a13f-5e83-45ab-ad84-23e90b2b323c
 s2 = 966ab0f5-887b-41f8-bd85-7cb9bce8fdfb
 s1 = 49bb049c-db4c-42dc-822d-fc89ed075e9a

check the chain qemu-img info:
file qemu-info-chain1.txt
file vdsm-Volume-Info1.txt


now start delete s1:

	
Jul 6, 2017 5:40:39 PM Snapshot 's1' deletion for VM 'mergeWftest9' was initiated by admin@internal-authz.

ul 6, 2017 5:42:48 PM Snapshot 's1' deletion for VM 'mergeWftest9' has been completed.



correlationId: bff808e8-3866-4d52-a061-995bcdc2fcea	
	


new situation:

snapshot_disk_id:
 current = 5abffd41-d079-47de-8dc6-b2aba94f801b
 s4 = 4c6c9cb7-a280-4ce9-96a4-2f7f4fb52137
 s3 = 6478a13f-5e83-45ab-ad84-23e90b2b323c
 s2 = 49bb049c-db4c-42dc-822d-fc89ed075e9a



check the chain qemu-img info:
file qemu-info-chain2.txt

vdsm-client Volume getInfo:
file volumeInfo2.txt 


update ovs_storage

now start delete s2:

	
Jul 6, 2017 5:48:16 PM
	
Snapshot 's2' deletion for VM 'mergeWftest9' was initiated by admin@internal-authz.

correlationId: 09edb0bd-ba1a-4fb8-a638-c141467278ce
Comment 22 Ala Hino 2017-07-07 06:43:52 EDT
Thank you very for providing this valuable information.
I will look into the new logs and see what I can learn from them.
Will update ASAP.
Comment 23 Ala Hino 2017-07-16 10:45:52 EDT
I see that during block commit job, libvirt VIR_EVENT_HANDLE_HANGUP is fired and it seems that the job didn't complete yet. I will have to investigate more why that event fired, what the consequences and how to recover.

How many hosts do you have in your environment? Can you upload the SPM logs?

Thanks.
Comment 24 richardfalzini 2017-07-17 08:06 EDT
Created attachment 1299824 [details]
test within SPM

hi,
i have 5 hosts in my configuration.
About the spm log, i could not find the vdsm log of the old test because they were already rewritten by the rotate. 
So i made a new test and i made it on the host with the SPM assigned.

Hope that in the tar there are all the log that you need,  if you need more i'll be happy to provide it.

thank you very much for helping
Comment 25 Ala Hino 2017-07-19 11:38:27 EDT
Can you please send qemu info of each volume in the chain?
Comment 26 richardfalzini 2017-07-19 12:01 EDT
Created attachment 1301237 [details]
test13_InsideSPM

By mistake I uploaded the old tar, please forgive me, here are the files of the new test.


vmName=	mergeWftest13
vmId= 4a2be9b3-e24d-45f6-ab56-123b5c60e253
disk_groupe_id= 11a424e1-65dc-4769-981c-10d312d26e86
mountpointpath: /rhev/data-center/00000001-0001-0001-0001-0000000001a9/73ca2906-e59d-4c13-97f4-f636cad9fb0e/images/11a424e1-65dc-4769-981c-10d312d26e86


snapshot_disk_id:
 current = 968e63b2-7a92-4045-82b6-69f96b6d66e8
 s4 = cb49d1ca-3d90-4ccb-805d-f8b60b596358
 s3 = dbde119b-5686-4ab9-8d8c-7ea4aaec52f8
 s2 = e49fbac2-9be8-44b4-84ff-2edb029e357c
 s1 = ebb51580-08fe-4c68-b03a-abc4a1fed413

check the chain qemu-img info:
	file qemu-info-chain1.txt
	file vdsm-Volume-Info1.txt


now start delete s1:

Jul 17, 2017 12:01:52 PM
	
Snapshot 's1' deletion for VM 'mergeWftest13' was initiated by admin@internal-authz.	

	
Jul 17, 2017 12:02:52 PM
	
Snapshot 's1' deletion for VM 'mergeWftest13' has been complete


correlationId: 	e8b27d6a-ebe6-456b-be54-39994c04eb66
	


new situation:

snapshot_disk_id:
 current = 968e63b2-7a92-4045-82b6-69f96b6d66e8
 s4 = cb49d1ca-3d90-4ccb-805d-f8b60b596358
 s3 = dbde119b-5686-4ab9-8d8c-7ea4aaec52f8
 s2 = ebb51580-08fe-4c68-b03a-abc4a1fed413



check the chain qemu-img info:
file qemu-info-chain2.txt

vdsm-client Volume getInfo:
file volumeInfo2.txt 


update ovs_storage

now start delete s2:

 Jul 17, 2017 12:09:13 PM Snapshot 's2' deletion for VM 'mergeWftest13' was initiated by admin@internal-authz.

	
correlationId: b1da1e27-17c5-4601-8bfb-d7d98561caa4



check the chain qemu-img info:
file qemu-info-chain3.txt

vdsm-client Volume getInfo:
file volumeInfo3.txt
Comment 27 Ala Hino 2017-07-19 14:33:49 EDT
Can you please HSM (the host running the VM) logs as well?
Comment 28 Ala Hino 2017-07-19 16:12:37 EDT
I've created a patch where I added more debug logs to help us better investigate the issue you encounter.

Can you apply the changes in your setup and try the flow again?

The patch is here: https://gerrit.ovirt.org/79622

In your setup, on the hosts (I need the SPM and the host running the VM or all the hosts):

1. Open /usr/share/vdsm/storage/fileVolume.py
2. From the patch add the highlighted lines exactly at the same lines
3. Restart the host
Comment 29 richardfalzini 2017-07-20 09:57 EDT
Created attachment 1301749 [details]
wf with patch 79622

Applied the patch, rebooted the node and executed the wf.

Can you please tell me where i can find the HSM log?

vmName=	testWfmerge14
vmId= a748fb84-2dc5-4ab5-89fb-cc25f2e8a25f
disk_groupe_id= e3d05664-42b3-4c8a-87c9-1dee5403dedd
mountpointpath: /rhev/data-center/00000001-0001-0001-0001-0000000001a9/73ca2906-e59d-4c13-97f4-f636cad9fb0e/images/e3d05664-42b3-4c8a-87c9-1dee5403dedd

snapshot_disk_id:
 current = 15f3790f-4763-40e1-9176-e688630df07b
 s4 = 2c29f69c-a552-4aed-ac5c-c23f5cb04e5e
 s3 = 64cd8559-9d75-41e9-9ae2-0c91ef7c8048
 s2 = 440676d3-fcb9-424d-bf54-77348b34e275
 s1 = 59286a0b-683c-4fa2-b167-59bf8c665141

check the chain qemu-img info:
	file qemu-info-chain1.txt
	file vdsm-Volume-Info1.txt


now start delete s1:
Jul 20, 2017 2:51:42 PM Snapshot 's1' deletion for VM 'testWfmerge14' was initiated by admin@internal-authz.
Jul 20, 2017 2:52:41 PM Snapshot 's1' deletion for VM 'testWfmerge14' has been completed.

correlationId: 	2c4abb74-da53-49c5-b28a-31141cb09937
	


new situation:

snapshot_disk_id:
 current = 15f3790f-4763-40e1-9176-e688630df07b
 s4 = 2c29f69c-a552-4aed-ac5c-c23f5cb04e5e
 s3 = 64cd8559-9d75-41e9-9ae2-0c91ef7c8048
 s2 = 59286a0b-683c-4fa2-b167-59bf8c665141




check the chain qemu-img info:
file qemu-info-chain2.txt

vdsm-client Volume getInfo:
file vdsm-Volume-Info2.txt


now start delete s2:

 Jul 20, 2017 3:02:44 PM Snapshot 's2' deletion for VM 'testWfmerge14' was initiated by admin@internal-authz.

	
correlationId: 6c84cce0-593f-45a0-8b14-e2cf39df4aa7 


check the chain qemu-img info:
file qemu-info-chain3.txt

vdsm-client Volume getInfo:
file volumeInfo3.txt
Comment 30 Ala Hino 2017-07-20 09:59:16 EDT
HSM is actually the host running the VM. The log file is still Vdsm log file on the machine.
Comment 31 richardfalzini 2017-07-20 11:01:53 EDT
So in this case (host running Vm and SPM) you need just the vdsm.log,
am i right?
The log are on the tar attached to Comment 29:
 -vdsm.log
 -vdsm.log.1.xz


thank you so much for helping me.
Comment 32 Ala Hino 2017-07-20 11:38:26 EDT
My bad, I wasn't looking for the correct message.

I am seeing something unexpected and added more messages that will help me understand the root cause off the issue.

I will have to ask you to apply the new messages (in addition to the existing ones) and try again. Can you?

Please make sure you look at the updated version, here:
https://gerrit.ovirt.org/#/c/79622/2/vdsm/storage/fileVolume.py

Thanks!
Comment 33 Ala Hino 2017-07-20 12:56:50 EDT
Can you please run the following command on the host where the VM is running?

rpm -qa
Comment 34 richardfalzini 2017-07-20 13:03 EDT
Created attachment 1301894 [details]
test 15 patch 2

made new patch
reboot host
execute wf

vmName=	mergeWftest15
vmId= 
disk_groupe_id= 73aef966-5dfe-4a8c-8bad-9c71a423a9ed
mountpointpath: /rhev/data-center/00000001-0001-0001-0001-0000000001a9/73ca2906-e59d-4c13-97f4-f636cad9fb0e/images/73aef966-5dfe-4a8c-8bad-9c71a423a9ed

snapshot_disk_id:
 current = d5e0eabd-3973-4502-bb1c-0eb6584fc50f
 s4 = f5e891bf-4251-481d-8aee-3ae87626da26
 s3 = 6473996c-a623-41b5-9388-6ffd9f3a471d
 s2 = ddee50e3-cd88-44e4-9f40-ffb0962f1cd7
 s1 = 675d0eef-644c-44d7-b196-17459f6bcc56

check the chain qemu-img info:
	file qemu-info-chain1.txt
	file vdsm-Volume-Info1.txt


now start delete s1:
Jul 20, 2017 6:34:23 PM Snapshot 's1' deletion for VM 'mergeWftest15' was initiated by admin@internal-authz.
	
Jul 20, 2017 6:35:15 PM Snapshot 's1' deletion for VM 'mergeWftest15' has been completed.

correlationId:  a9de6c4d-7465-4fe8-879c-c273be8f113e	
	
new situation:

 current = d5e0eabd-3973-4502-bb1c-0eb6584fc50f
 s4 = f5e891bf-4251-481d-8aee-3ae87626da26
 s3 = 6473996c-a623-41b5-9388-6ffd9f3a471d
 s2 = 675d0eef-644c-44d7-b196-17459f6bcc56



check the chain qemu-img info:
file qemu-info-chain2.txt

vdsm-client Volume getInfo:
file vdsm-Volume-Info2.txt


now start delete s2:

Jul 20, 2017 6:42:04 PM
	
Snapshot 's2' deletion for VM 'mergeWftest15' was initiated by admin@internal-authz.

	
correlationId: 675d0eef-644c-44d7-b196-17459f6bcc56


check the chain qemu-img info:
file qemu-info-chain3.txt

vdsm-client Volume getInfo:
file volumeInfo3.txt
Comment 35 richardfalzini 2017-07-20 13:06 EDT
Created attachment 1301895 [details]
rpm -qa

as request
rpm -qa
Comment 36 Ala Hino 2017-07-20 17:00:58 EDT
Thanks a lot for providing the info. This is much appreciated.

This failure is because the volume name (vm-images-repo-demo) contains "images" in its name.

I'd like to kindly ask you to enable DEBUG logs for following components in /etc/vdsm/logger.conf:

logger_root
logger_vds
logger_storage (probably already in DEBUG level)
logger_IOProcess

And try again but this time with a volume that doesn't contain "images" in its name.

Once again, thanks a lot for your cooperation on this!
Comment 37 Ala Hino 2017-07-20 18:53:04 EDT
I've applied a fix to handle volume names that include "images".

The fix is here: https://gerrit.ovirt.org/#/c/79657/1

Please note that the fix applied on the master branch.
If you'd like to apply the fix, in /usr/share/vdsm/storage/fileVolume.py replace line 169 with:

    domPath = self.imagePath.rsplit('images', 1)[0]

Hope this fixes the issue you see.
Comment 38 richardfalzini 2017-07-21 07:46:50 EDT
Thanks you so much for the help, the patch work well.
Do you need the vdsm.log?
Do you think that it will be available from 4.1.5?
Comment 39 Ala Hino 2017-07-21 07:52:00 EDT
Good news.
No need for vdsm.log.
Yes, the fix will be in 4.1.5.

Thank you for your cooperation on this bug.
Comment 40 Ala Hino 2017-07-21 13:47:51 EDT
This issue is about using "images" in file based storage domains - gluster or nfs.

To reproduce/verify:

1. Create a sd that contains "images" in its name
2. Create 3 snapshots - s1, s2 and s3
3. Delete s1 - this works
4. Delete s2 - this fails without the fix
Comment 41 Allon Mureinik 2017-07-23 05:42:09 EDT
(In reply to Ala Hino from comment #40)
> To reproduce/verify:
> 
> 1. Create a sd that contains "images" in its name
Don't you mean in its **path**?
Comment 42 Ala Hino 2017-07-23 05:47:27 EDT
Indeed. Fixed. Thanks
Comment 43 Allon Mureinik 2017-07-29 22:14:38 EDT
Eyal - this BZ was automatically moved to ON_QA, but no target release was set. Was that intentional?
Comment 44 Eyal Edri 2017-08-01 04:58:06 EDT
Hi Allon,
We never automated or agreed on the logic to add target release to bugs, and it's currently done manually by the bug owner/project maintainer.

There are several issues with automating it, and there wasn't an agreement on the process when it was discussed.
Comment 45 Lilach Zitnitski 2017-08-03 08:52:38 EDT
--------------------------------------
Tested with the following code:
----------------------------------------
rhevm-4.1.5-0.1.el7.noarch
vdsm-4.19.25-1.el7ev.x86_64

Tested with the following scenario:

Steps to Reproduce:
1.create vm with 4 snapshots
2.delete one snapshot
3.delete one more
4.wait until the engine realize that the merge have failed. 

Actual results:
Merge is completed successfully and the snapshot is removed. 

Expected results:

Moving to VERIFIED!
Comment 46 Allon Mureinik 2017-08-03 10:25:33 EDT
Ala, can you please add some doctext to this BZ?

Note You need to log in before you can comment on or make changes to this bug.