Bug 1013338
Summary: | Error when removing snapshot - Storage domain layout corrupted if domain is mounted by two different paths (e.g. FQDN and IP) | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Virtualization Manager | Reporter: | Raz Tamir <ratamir> | ||||||
Component: | ovirt-engine | Assignee: | Allon Mureinik <amureini> | ||||||
Status: | CLOSED CURRENTRELEASE | QA Contact: | Elad <ebenahar> | ||||||
Severity: | high | Docs Contact: | |||||||
Priority: | unspecified | ||||||||
Version: | 3.3.0 | CC: | acanan, acathrow, amureini, bazulay, ebenahar, iheim, lnatapov, lpeer, nsoffer, ratamir, Rhev-m-bugs, scohen, yeylon | ||||||
Target Milestone: | --- | ||||||||
Target Release: | 3.4.0 | ||||||||
Hardware: | Unspecified | ||||||||
OS: | Unspecified | ||||||||
Whiteboard: | storage | ||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | Environment: | ||||||||
Last Closed: | Type: | Bug | |||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | Storage | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Bug Depends On: | |||||||||
Bug Blocks: | 1078909, 1142926 | ||||||||
Attachments: |
|
(In reply to ratamir from comment #0) > Created attachment 804671 [details] > engine + vdsm logs > > Description of problem: > When trying to delete a snapshot that was created when the vm running (live > snapshot) , Can we see the creation of the snapshot in the logs? > > the nfs domain is somehow mounted twice on the host - once through fqdn once > through ip: > > [root@gold-vdsc ~]# ll /rhev/data-center/mnt/ > total 16 > drwxr-xr-x. 3 vdsm kvm 4096 Sep 23 17:03 10.35.64.106:_fastpass_ratamir_nfs1 > drwxr-xr-x. 3 vdsm kvm 4096 Sep 23 17:20 10.35.64.106:_fastpass_ratamir_nfs2 > drwxr-xr-x. 3 vdsm kvm 4096 Sep 23 17:03 > tiger.qa.lab.tlv.redhat.com:_fastpass_ratamir_nfs1 > drwxr-xr-x. 3 vdsm kvm 4096 Sep 23 17:20 > tiger.qa.lab.tlv.redhat.com:_fastpass_ratamir_nfs2 > > > so it seems vdsm sees the same > SDUUID(SD=d2d701ea-a185-483c-af8e-e71753dbb0cc) directory under 2 different > paths: > root@gold-vdsc ~]# tree /rhev/data-center/mnt/ | grep d2d > | |-- d2d701ea-a185-483c-af8e-e71753dbb0cc > | |-- d2d701ea-a185-483c-af8e-e71753dbb0cc Can you check how nfs domain is mounted before and after the creation of the snapshot? > Expected results: > The snapshot should be removed Can you tell on which version this operation did succeed? Please attach all vdsm logs (the attached 2 logs do not contain all the info necessary). Please attach /proc/mounts as well. The problem is as you stated that the domain is mounted twice, we need to figure out how you reached this state. Created attachment 804719 [details] all_vdsm_logs (In reply to Ayal Baron from comment #2) > Please attach all vdsm logs (the attached 2 logs do not contain all the info > necessary). all vdsm logs added > > Please attach /proc/mounts as well. [root@gold-vdsc ~]# mount /dev/mapper/vg0-lv_root on / type ext4 (rw) proc on /proc type proc (rw) sysfs on /sys type sysfs (rw) devpts on /dev/pts type devpts (rw,gid=5,mode=620) tmpfs on /dev/shm type tmpfs (rw,rootcontext="system_u:object_r:tmpfs_t:s0") /dev/sda1 on /boot type ext4 (rw) none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw) sunrpc on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw) tiger.qa.lab.tlv.redhat.com:/fastpass/ratamir/nfs1 on /rhev/data-center/mnt/tiger.qa.lab.tlv.redhat.com:_fastpass_ratamir_nfs1 type nfs (rw,soft,nosharecache,timeo=600,retrans=6,nfsvers=3,addr=10.35.64.106) tiger.qa.lab.tlv.redhat.com:/fastpass/ratamir/nfs2 on /rhev/data-center/mnt/tiger.qa.lab.tlv.redhat.com:_fastpass_ratamir_nfs2 type nfs (rw,soft,nosharecache,timeo=600,retrans=6,nfsvers=3,addr=10.35.64.106) 10.35.64.106:/fastpass/ratamir/nfs1 on /rhev/data-center/mnt/10.35.64.106:_fastpass_ratamir_nfs1 type nfs (rw,soft,nosharecache,timeo=600,retrans=6,nfsvers=3,addr=10.35.64.106) 10.35.64.106:/fastpass/ratamir/nfs2 on /rhev/data-center/mnt/10.35.64.106:_fastpass_ratamir_nfs2 type nfs (rw,soft,nosharecache,timeo=600,retrans=6,nfsvers=3,addr=10.35.64.106) > > The problem is as you stated that the domain is mounted twice, we need to > figure out how you reached this state. These logs still do not contain the info of when these domains were mounted unfortunately. Haim, why is this a regression? Can you reproduce? From the engine log: 2013-09-24 17:38:27,685 - Attempt to create the domain via IP mount (failed): 2013-09-24 17:38:27,685 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.CreateStorageDomainVDSCommand] (ajp-/127.0.0.1:8702-10) Command CreateStorageDomainVDS execution failed. Exception: VDSErrorException: VDSGenericException: VDSError Exception: Failed in vdscommand to CreateStorageDomainVDS, error = Storage domain is not empty - requires cleaning: ('10.35.64.106:/fastpass/ratamir/nfs1',) 2013-09-24 17:38:27,685 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.CreateStorageDomainVDSCommand] (ajp-/127.0.0.1:8702-10) FINISH, CreateStorageDomainVDSCommand, log id: 9e368ee 2013-09-24 17:38:27,686 ERROR [org.ovirt.engine.core.bll.storage.AddNFSStorageDomainCommand] (ajp-/127.0.0.1:8702-10) Command org.ovirt.engine.core.bll.storage.AddNFSStorageDomainCommand throw Vdc Bll exception. With error message VdcBLL Exception: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSErrorException: VDSGenericException: VDSErrorException: Failed in vdscommand to CreateStorageDomainVDS, error = Storage domain is not empty - requires cleaning: ('10.35.64.106:/fast pass/ratamir/nfs1',) (Failed with error StorageDomainNotEmpty and code 361) There are several instances of creating a storage domain (either successful or failing) in the log, both from IP and from FQDN. When the creation fails, there is no DisconnectStorageServerCommand upon failure, resulting in the same path being mounted twice. Action items: need to try to reproduce this scenario, and see: 1. Does this happen consistently (QA) 2. Check if the command is reported with success=true (even though it failed), explaining why no DisconnectStorageServer was issued (dev). In any event, resetting the component, since VDSM can't run a command that's never sent to it :-) (In reply to Allon Mureinik from comment #7) > From the engine log: > > 2013-09-24 17:38:27,685 - Attempt to create the domain via IP mount (failed): > 2013-09-24 17:38:27,685 ERROR > [org.ovirt.engine.core.vdsbroker.vdsbroker.CreateStorageDomainVDSCommand] > (ajp-/127.0.0.1:8702-10) Command CreateStorageDomainVDS execution failed. > Exception: VDSErrorException: VDSGenericException: VDSError > Exception: Failed in vdscommand to CreateStorageDomainVDS, error = Storage > domain is not empty - requires cleaning: > ('10.35.64.106:/fastpass/ratamir/nfs1',) > 2013-09-24 17:38:27,685 INFO > [org.ovirt.engine.core.vdsbroker.vdsbroker.CreateStorageDomainVDSCommand] > (ajp-/127.0.0.1:8702-10) FINISH, CreateStorageDomainVDSCommand, log id: > 9e368ee > 2013-09-24 17:38:27,686 ERROR > [org.ovirt.engine.core.bll.storage.AddNFSStorageDomainCommand] > (ajp-/127.0.0.1:8702-10) Command > org.ovirt.engine.core.bll.storage.AddNFSStorageDomainCommand throw Vdc Bll > exception. With error message VdcBLL > Exception: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSErrorException: > VDSGenericException: VDSErrorException: Failed in vdscommand to > CreateStorageDomainVDS, error = Storage domain is not empty - requires > cleaning: ('10.35.64.106:/fast > pass/ratamir/nfs1',) (Failed with error StorageDomainNotEmpty and code 361) > > There are several instances of creating a storage domain (either successful > or failing) in the log, both from IP and from FQDN. > When the creation fails, there is no DisconnectStorageServerCommand upon > failure, resulting in the same path being mounted twice. > > Action items: need to try to reproduce this scenario, and see: > 1. Does this happen consistently (QA) - From GUI the mount is cleaned. - From REST it does not clean after failed creation. > 2. Check if the command is reported with success=true (even though it > failed), explaining why no DisconnectStorageServer was issued (dev). This is another aspect of the same root cause described in bug 991470. Since this is a corner case only available via REST, proposing to postpone to 3.4. Moved to ON_QA, since the root cause was solved by bug 991470. 3.4.0-0.7.beta2.el6. No error when removing live snapshot from stopped VM. Closing as part of 3.4.0 |
Created attachment 804671 [details] engine + vdsm logs Description of problem: When trying to delete a snapshot that was created when the vm running (live snapshot) , an error message appears : "Error while executing action Remove Snapshot: Storage domain layout corrupted". from the vdsm log the problem occur during the "deleteImage": Thread-321928::DEBUG::2013-09-29 11:41:12,621::task::974::TaskManager.Task::(_decref) Task=`73c7cb7d-425b-4245-b283-5d8efe3f290f`::ref 1 aborting False Thread-321928::ERROR::2013-09-29 11:41:12,634::task::850::TaskManager.Task::(_setError) Task=`73c7cb7d-425b-4245-b283-5d8efe3f290f`::Unexpected error Traceback (most recent call last): File "/usr/share/vdsm/storage/task.py", line 857, in _run return fn(*args, **kargs) File "/usr/share/vdsm/logUtils.py", line 45, in wrapper res = f(*args, **kwargs) File "/usr/share/vdsm/storage/hsm.py", line 1543, in deleteImage dom.deleteImage(sdUUID, imgUUID, volsByImg) File "/usr/share/vdsm/storage/fileSD.py", line 343, in deleteImage currImgDir = getImagePath(sdUUID, imgUUID) File "/usr/share/vdsm/storage/fileSD.py", line 98, in getImagePath return os.path.join(getDomPath(sdUUID), 'images', imgUUID) File "/usr/share/vdsm/storage/fileSD.py", line 92, in getDomPath raise se.StorageDomainLayoutError(sdUUID) StorageDomainLayoutError: Storage domain layout corrupted: ('d2d701ea-a185-483c-af8e-e71753dbb0cc',) Thread-321928::DEBUG::2013-09-29 11:41:12,635::task::869::TaskManager.Task::(_run) Task=`73c7cb7d-425b-4245-b283-5d8efe3f290f`::Task._run: 73c7cb7d-425b-4245-b283-5d8efe3f290f ('d2d701ea-a1 85-483c-af8e-e71753dbb0cc', '0fc8637c-c5a9-463d-81d7-42266087900a', '6f151c8d-740c-459c-8c6d-61fb0b2273ed', 'false', 'false') {} failed - stopping task the nfs domain is somehow mounted twice on the host - once through fqdn once through ip: [root@gold-vdsc ~]# ll /rhev/data-center/mnt/ total 16 drwxr-xr-x. 3 vdsm kvm 4096 Sep 23 17:03 10.35.64.106:_fastpass_ratamir_nfs1 drwxr-xr-x. 3 vdsm kvm 4096 Sep 23 17:20 10.35.64.106:_fastpass_ratamir_nfs2 drwxr-xr-x. 3 vdsm kvm 4096 Sep 23 17:03 tiger.qa.lab.tlv.redhat.com:_fastpass_ratamir_nfs1 drwxr-xr-x. 3 vdsm kvm 4096 Sep 23 17:20 tiger.qa.lab.tlv.redhat.com:_fastpass_ratamir_nfs2 so it seems vdsm sees the same SDUUID(SD=d2d701ea-a185-483c-af8e-e71753dbb0cc) directory under 2 different paths: root@gold-vdsc ~]# tree /rhev/data-center/mnt/ | grep d2d | |-- d2d701ea-a185-483c-af8e-e71753dbb0cc | |-- d2d701ea-a185-483c-af8e-e71753dbb0cc Version-Release number of selected component (if applicable): vdsm-4.12.0-138.gitab256be.el6ev.x86_64 rhevm-3.3.0-0.22.master.el6ev.noarch How reproducible: 100% Steps to Reproduce: setup with 1 NFS DC ,1 host 2 storage domain. 1.create new vm 2.create live snapshot of that vm 3.when the snapshot status is "OK", "Power Off" the vm 4.delete the snapshot Actual results: Error message appears : "Error while executing action Remove Snapshot: Storage domain layout corrupted". and the snapshot status become "BROKEN" on the webadmin portal Expected results: The snapshot should be removed Additional info: