Bug 1013338

Summary: Error when removing snapshot - Storage domain layout corrupted if domain is mounted by two different paths (e.g. FQDN and IP)
Product: Red Hat Enterprise Virtualization Manager Reporter: Raz Tamir <ratamir>
Component: ovirt-engineAssignee: Allon Mureinik <amureini>
Status: CLOSED CURRENTRELEASE QA Contact: Elad <ebenahar>
Severity: high Docs Contact:
Priority: unspecified    
Version: 3.3.0CC: acanan, acathrow, amureini, bazulay, ebenahar, iheim, lnatapov, lpeer, nsoffer, ratamir, Rhev-m-bugs, scohen, yeylon
Target Milestone: ---   
Target Release: 3.4.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: storage
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Storage RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1078909, 1142926    
Attachments:
Description Flags
engine + vdsm logs
none
all_vdsm_logs none

Description Raz Tamir 2013-09-29 11:37:22 UTC
Created attachment 804671 [details]
engine + vdsm logs

Description of problem:
When trying to delete a snapshot that was created when the vm running (live snapshot) , an error message appears :
"Error while executing action Remove Snapshot: Storage domain layout corrupted".

from the vdsm log the problem occur during the "deleteImage":

Thread-321928::DEBUG::2013-09-29 11:41:12,621::task::974::TaskManager.Task::(_decref) Task=`73c7cb7d-425b-4245-b283-5d8efe3f290f`::ref 1 aborting False
Thread-321928::ERROR::2013-09-29 11:41:12,634::task::850::TaskManager.Task::(_setError) Task=`73c7cb7d-425b-4245-b283-5d8efe3f290f`::Unexpected error
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/task.py", line 857, in _run
    return fn(*args, **kargs)
  File "/usr/share/vdsm/logUtils.py", line 45, in wrapper
    res = f(*args, **kwargs)
  File "/usr/share/vdsm/storage/hsm.py", line 1543, in deleteImage
    dom.deleteImage(sdUUID, imgUUID, volsByImg)
  File "/usr/share/vdsm/storage/fileSD.py", line 343, in deleteImage
    currImgDir = getImagePath(sdUUID, imgUUID)
  File "/usr/share/vdsm/storage/fileSD.py", line 98, in getImagePath
    return os.path.join(getDomPath(sdUUID), 'images', imgUUID)
  File "/usr/share/vdsm/storage/fileSD.py", line 92, in getDomPath
    raise se.StorageDomainLayoutError(sdUUID)
StorageDomainLayoutError: Storage domain layout corrupted: ('d2d701ea-a185-483c-af8e-e71753dbb0cc',)
Thread-321928::DEBUG::2013-09-29 11:41:12,635::task::869::TaskManager.Task::(_run) Task=`73c7cb7d-425b-4245-b283-5d8efe3f290f`::Task._run: 73c7cb7d-425b-4245-b283-5d8efe3f290f ('d2d701ea-a1
85-483c-af8e-e71753dbb0cc', '0fc8637c-c5a9-463d-81d7-42266087900a', '6f151c8d-740c-459c-8c6d-61fb0b2273ed', 'false', 'false') {} failed - stopping task

the nfs domain is somehow mounted twice on the host - once through fqdn once through ip:

[root@gold-vdsc ~]# ll /rhev/data-center/mnt/
total 16
drwxr-xr-x. 3 vdsm kvm 4096 Sep 23 17:03 10.35.64.106:_fastpass_ratamir_nfs1
drwxr-xr-x. 3 vdsm kvm 4096 Sep 23 17:20 10.35.64.106:_fastpass_ratamir_nfs2
drwxr-xr-x. 3 vdsm kvm 4096 Sep 23 17:03 tiger.qa.lab.tlv.redhat.com:_fastpass_ratamir_nfs1
drwxr-xr-x. 3 vdsm kvm 4096 Sep 23 17:20 tiger.qa.lab.tlv.redhat.com:_fastpass_ratamir_nfs2


so it seems vdsm sees the same SDUUID(SD=d2d701ea-a185-483c-af8e-e71753dbb0cc) directory under 2 different paths:
root@gold-vdsc ~]# tree /rhev/data-center/mnt/ | grep d2d
|   |-- d2d701ea-a185-483c-af8e-e71753dbb0cc
|   |-- d2d701ea-a185-483c-af8e-e71753dbb0cc


Version-Release number of selected component (if applicable):
vdsm-4.12.0-138.gitab256be.el6ev.x86_64
rhevm-3.3.0-0.22.master.el6ev.noarch

How reproducible:
100%

Steps to Reproduce:
setup with 1 NFS DC ,1 host 2 storage domain.

1.create new vm
2.create live snapshot of that vm
3.when the snapshot status is "OK", "Power Off" the vm 
4.delete the snapshot

Actual results:
Error message appears :
"Error while executing action Remove Snapshot: Storage domain layout corrupted".

and the snapshot status become "BROKEN" on the webadmin portal

Expected results:
The snapshot should be removed

Additional info:

Comment 1 Nir Soffer 2013-09-29 12:18:44 UTC
(In reply to ratamir from comment #0)
> Created attachment 804671 [details]
> engine + vdsm logs
> 
> Description of problem:
> When trying to delete a snapshot that was created when the vm running (live
> snapshot) ,

Can we see the creation of the snapshot in the logs? 

> 
> the nfs domain is somehow mounted twice on the host - once through fqdn once
> through ip:
> 
> [root@gold-vdsc ~]# ll /rhev/data-center/mnt/
> total 16
> drwxr-xr-x. 3 vdsm kvm 4096 Sep 23 17:03 10.35.64.106:_fastpass_ratamir_nfs1
> drwxr-xr-x. 3 vdsm kvm 4096 Sep 23 17:20 10.35.64.106:_fastpass_ratamir_nfs2
> drwxr-xr-x. 3 vdsm kvm 4096 Sep 23 17:03
> tiger.qa.lab.tlv.redhat.com:_fastpass_ratamir_nfs1
> drwxr-xr-x. 3 vdsm kvm 4096 Sep 23 17:20
> tiger.qa.lab.tlv.redhat.com:_fastpass_ratamir_nfs2
> 
> 
> so it seems vdsm sees the same
> SDUUID(SD=d2d701ea-a185-483c-af8e-e71753dbb0cc) directory under 2 different
> paths:
> root@gold-vdsc ~]# tree /rhev/data-center/mnt/ | grep d2d
> |   |-- d2d701ea-a185-483c-af8e-e71753dbb0cc
> |   |-- d2d701ea-a185-483c-af8e-e71753dbb0cc

Can you check how nfs domain is mounted before and after the creation of the snapshot?

> Expected results:
> The snapshot should be removed

Can you tell on which version this operation did succeed?

Comment 2 Ayal Baron 2013-09-29 13:09:39 UTC
Please attach all vdsm logs (the attached 2 logs do not contain all the info necessary).

Please attach /proc/mounts as well.

The problem is as you stated that the domain is mounted twice, we need to figure out how you reached this state.

Comment 3 Raz Tamir 2013-09-29 13:49:03 UTC
Created attachment 804719 [details]
all_vdsm_logs

(In reply to Ayal Baron from comment #2)
> Please attach all vdsm logs (the attached 2 logs do not contain all the info
> necessary).
all vdsm logs added

> 
> Please attach /proc/mounts as well.
[root@gold-vdsc ~]# mount
/dev/mapper/vg0-lv_root on / type ext4 (rw)
proc on /proc type proc (rw)               
sysfs on /sys type sysfs (rw)              
devpts on /dev/pts type devpts (rw,gid=5,mode=620)
tmpfs on /dev/shm type tmpfs (rw,rootcontext="system_u:object_r:tmpfs_t:s0")
/dev/sda1 on /boot type ext4 (rw)                                           
none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw)                      
sunrpc on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw)                      
tiger.qa.lab.tlv.redhat.com:/fastpass/ratamir/nfs1 on /rhev/data-center/mnt/tiger.qa.lab.tlv.redhat.com:_fastpass_ratamir_nfs1 type nfs (rw,soft,nosharecache,timeo=600,retrans=6,nfsvers=3,addr=10.35.64.106)                                                                                                                                                                            
tiger.qa.lab.tlv.redhat.com:/fastpass/ratamir/nfs2 on /rhev/data-center/mnt/tiger.qa.lab.tlv.redhat.com:_fastpass_ratamir_nfs2 type nfs (rw,soft,nosharecache,timeo=600,retrans=6,nfsvers=3,addr=10.35.64.106)                                                                                                                                                                            
10.35.64.106:/fastpass/ratamir/nfs1 on /rhev/data-center/mnt/10.35.64.106:_fastpass_ratamir_nfs1 type nfs (rw,soft,nosharecache,timeo=600,retrans=6,nfsvers=3,addr=10.35.64.106)             
10.35.64.106:/fastpass/ratamir/nfs2 on /rhev/data-center/mnt/10.35.64.106:_fastpass_ratamir_nfs2 type nfs (rw,soft,nosharecache,timeo=600,retrans=6,nfsvers=3,addr=10.35.64.106)  

> 
> The problem is as you stated that the domain is mounted twice, we need to
> figure out how you reached this state.

Comment 4 Ayal Baron 2013-09-29 14:26:23 UTC
These logs still do not contain the info of when these domains were mounted unfortunately.

Comment 6 Ayal Baron 2013-09-29 19:08:57 UTC
Haim, why is this a regression?
Can you reproduce?

Comment 7 Allon Mureinik 2013-09-30 08:29:55 UTC
From the engine log:

2013-09-24 17:38:27,685 - Attempt to create the domain via IP mount (failed):
2013-09-24 17:38:27,685 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.CreateStorageDomainVDSCommand] (ajp-/127.0.0.1:8702-10) Command CreateStorageDomainVDS execution failed. Exception: VDSErrorException: VDSGenericException: VDSError
Exception: Failed in vdscommand to CreateStorageDomainVDS, error = Storage domain is not empty - requires cleaning: ('10.35.64.106:/fastpass/ratamir/nfs1',)
2013-09-24 17:38:27,685 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.CreateStorageDomainVDSCommand] (ajp-/127.0.0.1:8702-10) FINISH, CreateStorageDomainVDSCommand, log id: 9e368ee
2013-09-24 17:38:27,686 ERROR [org.ovirt.engine.core.bll.storage.AddNFSStorageDomainCommand] (ajp-/127.0.0.1:8702-10) Command org.ovirt.engine.core.bll.storage.AddNFSStorageDomainCommand throw Vdc Bll exception. With error message VdcBLL
Exception: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSErrorException: VDSGenericException: VDSErrorException: Failed in vdscommand to CreateStorageDomainVDS, error = Storage domain is not empty - requires cleaning: ('10.35.64.106:/fast
pass/ratamir/nfs1',) (Failed with error StorageDomainNotEmpty and code 361)

There are several instances of creating a storage domain (either successful or failing) in the log, both from IP and from FQDN.
When the creation fails, there is no DisconnectStorageServerCommand upon failure, resulting in the same path being mounted twice.

Action items: need to try to reproduce this scenario, and see:
1. Does this happen consistently (QA)
2. Check if the command is reported with success=true (even though it failed), explaining why no DisconnectStorageServer was issued (dev).

Comment 8 Allon Mureinik 2013-09-30 08:32:18 UTC
In any event, resetting the component, since VDSM can't run a command that's never sent to it :-)

Comment 9 Raz Tamir 2013-09-30 10:41:52 UTC
(In reply to Allon Mureinik from comment #7)
> From the engine log:
> 
> 2013-09-24 17:38:27,685 - Attempt to create the domain via IP mount (failed):
> 2013-09-24 17:38:27,685 ERROR
> [org.ovirt.engine.core.vdsbroker.vdsbroker.CreateStorageDomainVDSCommand]
> (ajp-/127.0.0.1:8702-10) Command CreateStorageDomainVDS execution failed.
> Exception: VDSErrorException: VDSGenericException: VDSError
> Exception: Failed in vdscommand to CreateStorageDomainVDS, error = Storage
> domain is not empty - requires cleaning:
> ('10.35.64.106:/fastpass/ratamir/nfs1',)
> 2013-09-24 17:38:27,685 INFO 
> [org.ovirt.engine.core.vdsbroker.vdsbroker.CreateStorageDomainVDSCommand]
> (ajp-/127.0.0.1:8702-10) FINISH, CreateStorageDomainVDSCommand, log id:
> 9e368ee
> 2013-09-24 17:38:27,686 ERROR
> [org.ovirt.engine.core.bll.storage.AddNFSStorageDomainCommand]
> (ajp-/127.0.0.1:8702-10) Command
> org.ovirt.engine.core.bll.storage.AddNFSStorageDomainCommand throw Vdc Bll
> exception. With error message VdcBLL
> Exception: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSErrorException:
> VDSGenericException: VDSErrorException: Failed in vdscommand to
> CreateStorageDomainVDS, error = Storage domain is not empty - requires
> cleaning: ('10.35.64.106:/fast
> pass/ratamir/nfs1',) (Failed with error StorageDomainNotEmpty and code 361)
> 
> There are several instances of creating a storage domain (either successful
> or failing) in the log, both from IP and from FQDN.
> When the creation fails, there is no DisconnectStorageServerCommand upon
> failure, resulting in the same path being mounted twice.
> 
> Action items: need to try to reproduce this scenario, and see:
> 1. Does this happen consistently (QA)

- From GUI the  mount is cleaned.
- From REST it does not clean after failed creation.

> 2. Check if the command is reported with success=true (even though it
> failed), explaining why no DisconnectStorageServer was issued (dev).

Comment 10 Allon Mureinik 2013-09-30 16:28:53 UTC
This is another aspect of the same root cause described in bug 991470.

Since this is a corner case only available via REST, proposing to postpone to 3.4.

Comment 12 Allon Mureinik 2014-02-16 08:50:39 UTC
Moved to ON_QA, since the root cause was solved by bug 991470.

Comment 13 Leonid Natapov 2014-02-18 11:24:38 UTC
3.4.0-0.7.beta2.el6. No error when removing live snapshot from stopped VM.

Comment 15 Itamar Heim 2014-06-12 14:07:44 UTC
Closing as part of 3.4.0