Created attachment 819598 [details] vdsm and engine logs Description of problem: Storage domain link does not exist under /rhev/data-center/SPUUID, but vdsm reports that the domain status is 'active' by getStoragePoolInfo: (The problematic SD is c5c9d15d-e375-4040-ba86-7a36d15e3072) Thread-40544::INFO::2013-11-05 11:08:02,586::logUtils::47::dispatcher::(wrapper) Run and protect: getStoragePoolInfo, Return response: {'info': {'spm_id': 1, 'master_uuid': 'fdc629ee-f335-49f0-ba18-627298358070', 'name': 'iscsi2', 'version': '3', 'domains': u'fdc629ee-f335-49f0-ba18-627298358070:Active,c5c9d15d-e375-4040-ba86-7a36d15e3072 Under /rhev/data-center/ : [root@nott-vds1 ~]# ll /rhev/data-center/e37919d1-5622-4186-8860-2db87106a70f/ total 8 lrwxrwxrwx. 1 vdsm kvm 66 Nov 4 17:01 fdc629ee-f335-49f0-ba18-627298358070 -> /rhev/data-center/mnt/blockSD/fdc629ee-f335-49f0-ba18-627298358070 lrwxrwxrwx. 1 vdsm kvm 66 Nov 4 17:01 mastersd -> /rhev/data-center/mnt/blockSD/fdc629ee-f335-49f0-ba18-627298358070 ^^ link to domain c5c9d15d-e375-4040-ba86-7a36d15e3072 does not exist ^^ Engine reports that the domain is in 'Active' status. When I try to create a new disk under this domain, I'm failing with: 7648b234-d07d-496a-83fd-a8e5852b75eb::ERROR::2013-11-05 09:42:22,623::task::850::TaskManager.Task::(_setError) Task=`7648b234-d07d-496a-83fd-a8e5852b75eb`::Unexpected error Traceback (most recent call last): File "/usr/share/vdsm/storage/task.py", line 857, in _run return fn(*args, **kargs) File "/usr/share/vdsm/storage/task.py", line 318, in run return self.cmd(*self.argslist, **self.argsdict) File "/usr/share/vdsm/storage/securable.py", line 68, in wrapper return f(self, *args, **kwargs) File "/usr/share/vdsm/storage/sp.py", line 1905, in createVolume desc=desc, srcImgUUID=srcImgUUID, srcVolUUID=srcVolUUID) File "/usr/share/vdsm/storage/sd.py", line 411, in createVolume preallocate, diskType, volUUID, desc, srcImgUUID, srcVolUUID) File "/usr/share/vdsm/storage/volume.py", line 406, in create imgPath = image.Image(repoPath).create(sdUUID, imgUUID) File "/usr/share/vdsm/storage/image.py", line 128, in create os.mkdir(imageDir) OSError: [Errno 2] No such file or directory: '/rhev/data-center/e37919d1-5622-4186-8860-2db87106a70f/c5c9d15d-e375-4040-ba86-7a36d15e3072/images/d6221bbc-ee6a-4558-8e87-1d4081559560' Version-Release number of selected component (if applicable): vdsm-4.13.0-0.5.beta1.el6ev.x86_64 How reproducible: 100% reproducible when we got a situation in which the domain's link is missing. Steps to Reproduce: Happened to me on a block pool (iscsi) with 1 host connected. Need a scenario which the domain's link is missing, I'm not sure how I got it, it's still under investigation. Actual results: vdsm reports that the domain is in 'active' status, even though the domain's link is missing from /rhev/data-center/SPUUID/ directory. This is causing to engine to report that the domain is active, and we fail to perform actions on this domain. Expected results: When storage domain link is missing from /rhev/data-center/SPUUID/ directory, vdsm should not report that the domain is in 'active' status Additional info: logs
Elad, any update on how you reached this situation to begin with?
(In reply to Ayal Baron from comment #1) > Elad, any update on how you reached this situation to begin with? I Still don't know. But I think it is not relevant, IMO, VDSM should take under consideration the existing of the SD link under /rhev/data-center/ when it determinate the domain status.
Created attachment 832561 [details] logs for missing links under /rhev/data-center I've a situation in which there are missing link of a SD under /rhev/data-center/ Attaching the logs
retargetting considering the apparent ease of reproduction.
Created attachment 832639 [details] engine logs for missing links under /rhev/data-center
Answering to Ayal's question from https://bugzilla.redhat.com/show_bug.cgi?id=986652 on comment 10: (In reply to Ayal Baron from comment #10) > (In reply to Elad from comment #9) > > (In reply to Ayal Baron from comment #8) > > > Elad, reproduction steps? > > > > It happened when I blocked connection to storage server which one of the > > domains in the pool is located on (not the master) > while it was not active, right? > and then you activated it at some point? and it did not recreate the links? The domain was active. The last log of the domain status before the connectivity loss (the relevant domain is 3bbf19c4-3a0d-4a91-883a-9824245659ee): 2013-12-04 11:45:55,330 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (DefaultQuartzScheduler_Worker-21) Storage Domain 3bbf19c4-3a0d-4a91-883a-9824245659ee:iscsi5-3 was reported by Host nott-vds1 as Active in Pool c9fcc2ba-c091-4e29-861e-db132ea6e4b8, moving to active status .......... The first log pointing on a problem with the domain on: 2013-12-04 11:46:25,589 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (pool-4-thread-48) domain 3bbf19c4-3a0d-4a91-883a-9824245659ee:iscsi5-3 in problem. vds: nott-vds1 .......... 2013-12-04 11:51:25,642 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (pool-4-thread-49) Domain 3bbf19c4-3a0d-4a91-883a-9824245659ee:iscsi5-3 was reported by all hosts in status UP as problematic. Moving the domain to NonOperational. .......... After I resumed connectivity with the storage, storage domain activation succeeds: 2013-12-04 11:52:39,012 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.ActivateStorageDomainVDSCommand] (pool-4-thread-44) [1033bdd6] START, ActivateStorageDomainVDSCommand( storagePoolId = c9fcc2ba-c091-4e29-86 1e-db132ea6e4b8, ignoreFailoverLimit = false, storageDomainId = 3bbf19c4-3a0d-4a91-883a-9824245659ee), log id: 7f9ce10b 2013-12-04 11:52:43,316 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.ActivateStorageDomainVDSCommand] (pool-4-thread-44) [1033bdd6] FINISH, ActivateStorageDomainVDSCommand, log id: 7f9ce10b ........ The domain is now active, trying to add a disk which located on the domain: 2013-12-04 11:53:15,563 INFO [org.ovirt.engine.core.bll.AddDiskCommand] (ajp-/127.0.0.1:8702-5) [a95b472] Running command: AddDiskCommand internal: false. Entities affected : ID: 3bbf19c4-3a0d-4a91-883a-9824245659ee Type: Storage ........ The add disk fails: 2013-12-04 11:53:24,166 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMGetAllTasksStatusesVDSCommand] (DefaultQuartzScheduler_Worker-21) Error code GeneralException and error message VDSGenericException: VDSErrorException: Failed to HSMGetAllTasksStatusesVDS, error = [Errno 2] No such file or directory: '/rhev/data-center/c9fcc2ba-c091-4e29-861e-db132ea6e4b8/3bbf19c4-3a0d-4a91-883a-9824245659ee/images/bb56f21e-ca43-4109-b1b5-d3e649a955ab'
I am still investigating, but from VDSM logs it looks like the activation of Storage Domain '3bbf19c4-3a0d-4a91-883a-9824245659ee' is actually failed at 2013-12-04 11:52. Start follow this tread (activateSD): Thread-628::DEBUG::2013-12-04 11:52:08,802::task::974::TaskManager.Task::(_decref) Task=`bb144a1d-262f-4139-8e8f-46ef7907a32a`::ref 1 aborting False Thread-628::INFO::2013-12-04 11:52:08,803::sp::1110::Storage.StoragePool::(activateSD) sdUUID=3bbf19c4-3a0d-4a91-883a-9824245659ee spUUID=c9fcc2ba-c091-4e29-861e-db132ea6e4b8 Finally get this: Thread-628::DEBUG::2013-12-04 11:52:11,203::lvm::309::Storage.Misc.excCmd::(cmd) '/usr/bin/sudo -n /sbin/lvm vgs --config " devices { preferred_names = [\\"^/dev/mapper/\\"] ignore_suspended_devices=1 write_cache_state=0 disable_after_error_count=3 obtain_device_list_from_udev=0 filter = [ \'a|/dev/mapper/1elad1.12113859197|/dev/mapper/1elad1.12213859197|/dev/mapper/1elad1113738004|/dev/mapper/1elad1213738004|/dev/mapper/1elad1313738004|/dev/mapper/1elad1413738004|/dev/mapper/1elad1513738004|\', \'r|.*|\' ] } global { locking_type=1 prioritise_write_locks=1 wait_for_locks=1 } backup { retain_min = 50 retain_days = 0 } " --noheadings --units b --nosuffix --separator | --ignoreskippedcluster -o uuid,name,attr,size,free,extent_size,extent_count,free_count,tags,vg_mda_size,vg_mda_free,lv_count,pv_count,pv_name 3bbf19c4-3a0d-4a91-883a-9824245659ee' (cwd None) Thread-170::DEBUG::2013-12-04 11:52:11,240::iscsiadm::91::Storage.Misc.excCmd::(_runCmd) SUCCESS: <err> = ''; <rc> = 0 Thread-170::DEBUG::2013-12-04 11:52:11,241::misc::819::SamplingMethod::(__call__) Returning last result Thread-628::DEBUG::2013-12-04 11:52:11,436::lvm::309::Storage.Misc.excCmd::(cmd) FAILED: <err> = ' /dev/mapper/1elad1.12213859197: read failed after 0 of 4096 at 32212189184: Input/output error\n /dev/mapper/1elad1.12213859197: read failed after 0 of 4096 at 32212246528: Input/output error\n /dev/mapper/1elad1.12213859197: read failed after 0 of 4096 at 0: Input/output error\n WARNING: Error counts reached a limit of 3. Device /dev/mapper/1elad1.12213859197 was disabled\n /dev/mapper/1elad1.12113859197: read failed after 0 of 4096 at 32212189184: Input/output error\n /dev/mapper/1elad1.12113859197: read failed after 0 of 4096 at 32212246528: Input/output error\n /dev/mapper/1elad1.12113859197: read failed after 0 of 4096 at 0: Input/output error\n WARNING: Error counts reached a limit of 3. Device /dev/mapper/1elad1.12113859197 was disabled\n Volume group "3bbf19c4-3a0d-4a91-883a-9824245659ee" not found\n Skipping volume group 3bbf19c4-3a0d-4a91-883a-9824245659ee\n'; <rc> = 5 Thread-628::WARNING::2013-12-04 11:52:11,439::lvm::391::Storage.LVM::(_reloadvgs) lvm vgs failed: 5 [] [' /dev/mapper/1elad1.12213859197: read failed after 0 of 4096 at 32212189184: Input/output error', ' /dev/mapper/1elad1.12213859197: read failed after 0 of 4096 at 32212246528: Input/output error', ' /dev/mapper/1elad1.12213859197: read failed after 0 of 4096 at 0: Input/output error', ' WARNING: Error counts reached a limit of 3. Device /dev/mapper/1elad1.12213859197 was disabled', ' /dev/mapper/1elad1.12113859197: read failed after 0 of 4096 at 32212189184: Input/output error', ' /dev/mapper/1elad1.12113859197: read failed after 0 of 4096 at 32212246528: Input/output error', ' /dev/mapper/1elad1.12113859197: read failed after 0 of 4096 at 0: Input/output error', ' WARNING: Error counts reached a limit of 3. Device /dev/mapper/1elad1.12113859197 was disabled', ' Volume group "3bbf19c4-3a0d-4a91-883a-9824245659ee" not found', ' Skipping volume group 3bbf19c4-3a0d-4a91-883a-9824245659ee'] Thread-628::DEBUG::2013-12-04 11:52:11,439::lvm::428::OperationMutex::(_reloadvgs) Operation 'lvm reload operation' released the operation mutex Thread-628::ERROR::2013-12-04 11:52:11,455::sdc::143::Storage.StorageDomainCache::(_findDomain) domain 3bbf19c4-3a0d-4a91-883a-9824245659ee not found Traceback (most recent call last): File "/usr/share/vdsm/storage/sdc.py", line 141, in _findDomain dom = findMethod(sdUUID) File "/usr/share/vdsm/storage/sdc.py", line 171, in _findUnfetchedDomain raise se.StorageDomainDoesNotExist(sdUUID) StorageDomainDoesNotExist: Storage domain does not exist: ('3bbf19c4-3a0d-4a91-883a-9824245659ee',) Thread-628::ERROR::2013-12-04 11:52:11,455::task::850::TaskManager.Task::(_setError) Task=`bb144a1d-262f-4139-8e8f-46ef7907a32a`::Unexpected error Traceback (most recent call last): File "/usr/share/vdsm/storage/task.py", line 857, in _run return fn(*args, **kargs) File "/usr/share/vdsm/logUtils.py", line 45, in wrapper res = f(*args, **kwargs) File "/usr/share/vdsm/storage/hsm.py", line 1237, in activateStorageDomain pool.activateSD(sdUUID) File "/usr/share/vdsm/storage/securable.py", line 68, in wrapper return f(self, *args, **kwargs) File "/usr/share/vdsm/storage/sp.py", line 1112, in activateSD dom = sdCache.produce(sdUUID) File "/usr/share/vdsm/storage/sdc.py", line 98, in produce domain.getRealDomain() File "/usr/share/vdsm/storage/sdc.py", line 52, in getRealDomain return self._cache._realProduce(self._sdUUID) File "/usr/share/vdsm/storage/sdc.py", line 122, in _realProduce domain = self._findDomain(sdUUID) File "/usr/share/vdsm/storage/sdc.py", line 141, in _findDomain dom = findMethod(sdUUID) File "/usr/share/vdsm/storage/sdc.py", line 171, in _findUnfetchedDomain raise se.StorageDomainDoesNotExist(sdUUID) StorageDomainDoesNotExist: Storage domain does not exist: ('3bbf19c4-3a0d-4a91-883a-9824245659ee',) Thread-628::DEBUG::2013-12-04 11:52:11,456::task::869::TaskManager.Task::(_run) Task=`bb144a1d-262f-4139-8e8f-46ef7907a32a`::Task._run: bb144a1d-262f-4139-8e8f-46ef7907a32a ('3bbf19c4-3a0d-4a91-883a-9824245659ee', 'c9fcc2ba-c091-4e29-861e-db132ea6e4b8') {} failed - stopping task Thread-628::DEBUG::2013-12-04 11:52:11,456::task::1194::TaskManager.Task::(stop) Task=`bb144a1d-262f-4139-8e8f-46ef7907a32a`::stopping in state preparing (force False)
*** Bug 986652 has been marked as a duplicate of this bug. ***
*** Bug 1026268 has been marked as a duplicate of this bug. ***
Created attachment 857548 [details] engine and vdsm logs (30.1.14) Reproduced again on is29, attaching the relevant logs
moving to 3.3.2 since 3.3.1 was built and moved to QE. please make sure to clone the bug if you want to hit z-stream release.
*** Bug 949248 has been marked as a duplicate of this bug. ***
Storage domain's links under /rhev/data-center/<pool-uuid>/ are restored once host got activated from non-operational state. Verification steps: 1) Created a new shared DC with 2 host, created master domain, 1 host took SPM 2) Created second domain located on a different storage server 3) Blocked connectivity from HSM host to the non-master domain, host became non-operational, both domains remained 'active'. While one of the pool's domain is unreachable to HSM host and it was non-operational, links of the 2nd domain under /rhev/data-center/<pool-uuid>/ has disappeared. 4) Resumed connectivity from HSM host to non-master domain's storage server, host became 'up' and links were created again: Thread-37706::INFO::2014-02-26 12:15:14,034::sp::1104::Storage.StoragePool::(_linkStorageDomain) Linking /rhev/data-center/mnt/lion.qa.lab.tlv.redhat.com:_export_elad_4/beeac7e6-643e-4708-b5f8-ae2c00f1d791 to /rhe v/data-center/2b812207-5ac8-4927-9839-9706b7021296/beeac7e6-643e-4708-b5f8-ae2c00f1d791 Thread-37706::INFO::2014-02-26 12:15:14,035::sp::1104::Storage.StoragePool::(_linkStorageDomain) Linking /rhev/data-center/mnt/10.35.160.108:_RHEV_elad_9/f459e923-be62-475d-90a1-00d68dfa500d to /rhev/data-center/2 b812207-5ac8-4927-9839-9706b7021296/f459e923-be62-475d-90a1-00d68dfa500d Thread-37706::INFO::2014-02-26 12:15:14,035::sp::1104::Storage.StoragePool::(_linkStorageDomain) Linking /rhev/data-center/mnt/lion.qa.lab.tlv.redhat.com:_export_elad_3/c595a297-7575-4950-94a2-ff51b77c53ad to /rhe v/data-center/2b812207-5ac8-4927-9839-9706b7021296/c595a297-7575-4950-94a2-ff51b77c53ad Thread-37706::INFO::2014-02-26 12:15:14,035::sp::1104::Storage.StoragePool::(_linkStorageDomain) Linking /rhev/data-center/mnt/lion.qa.lab.tlv.redhat.com:_export_elad_3/c595a297-7575-4950-94a2-ff51b77c53ad to /rhe v/data-center/2b812207-5ac8-4927-9839-9706b7021296/mastersd Verified with ovirt-engine-3.4-beta3: vdsm-4.14.3-0.el6.x86_64 ovirt-engine-3.4.0-0.11.beta3.el6.noarch
Compared with older version of vdsm (vdsm-4.13.2-0.5.el6ev.x86_64) - non-op host doesn't create links under /rhev/data-center/spuuid/ during activation, when it has full connectivity to the storage server.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2014-0504.html