Bug 1026697 - [vdsm] storage domain is reported as 'active' by host, even though its link under /rhev/data-center/SPUUID/ is missing
Summary: [vdsm] storage domain is reported as 'active' by host, even though its link u...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: vdsm
Version: 3.3.0
Hardware: x86_64
OS: Unspecified
unspecified
high
Target Milestone: ---
: 3.4.0
Assignee: Sergey Gotliv
QA Contact: Elad
URL:
Whiteboard: storage
: 949248 986652 (view as bug list)
Depends On:
Blocks: 1066409 rhev3.4beta 1142926
TreeView+ depends on / blocked
 
Reported: 2013-11-05 09:29 UTC by Elad
Modified: 2018-12-06 15:29 UTC (History)
13 users (show)

Fixed In Version: v4.14.0
Doc Type: Bug Fix
Doc Text:
Previously, VDSM would report that a storage domain was active even although a link for that storage domain was not present in the /rhev/data-center/SPUUID directory. This would occur when the host was in a non-operational state and was able to connect to the storage pool but was not able to connect to any storage domains. Now, the logic for refreshing storage domain links has been revised so that those links will be created correctly for storage domains even under these circumstances.
Clone Of:
: 1066409 (view as bug list)
Environment:
Last Closed: 2014-06-09 13:26:22 UTC
oVirt Team: Storage
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
vdsm and engine logs (6.32 MB, application/x-gzip)
2013-11-05 09:29 UTC, Elad
no flags Details
logs for missing links under /rhev/data-center (1.59 MB, application/x-gzip)
2013-12-04 10:36 UTC, Elad
no flags Details
engine logs for missing links under /rhev/data-center (7.92 MB, text/plain)
2013-12-04 13:05 UTC, Elad
no flags Details
engine and vdsm logs (30.1.14) (1.53 MB, application/x-gzip)
2014-01-30 15:42 UTC, Elad
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2014:0504 0 normal SHIPPED_LIVE vdsm 3.4.0 bug fix and enhancement update 2014-06-09 17:21:35 UTC
oVirt gerrit 22467 0 'None' MERGED hsm: refresh pool connection on connectStoragePool 2020-12-23 09:47:04 UTC

Description Elad 2013-11-05 09:29:33 UTC
Created attachment 819598 [details]
vdsm and engine logs

Description of problem:
Storage domain link does not exist under /rhev/data-center/SPUUID, but vdsm reports that the domain status is 'active' by getStoragePoolInfo:

(The problematic SD is c5c9d15d-e375-4040-ba86-7a36d15e3072)

Thread-40544::INFO::2013-11-05 11:08:02,586::logUtils::47::dispatcher::(wrapper) Run and protect: getStoragePoolInfo, Return response: {'info': {'spm_id': 1, 'master_uuid': 'fdc629ee-f335-49f0-ba18-627298358070', 'name': 'iscsi2', 'version': '3', 'domains': u'fdc629ee-f335-49f0-ba18-627298358070:Active,c5c9d15d-e375-4040-ba86-7a36d15e3072


Under /rhev/data-center/ :

[root@nott-vds1 ~]# ll /rhev/data-center/e37919d1-5622-4186-8860-2db87106a70f/
total 8
lrwxrwxrwx. 1 vdsm kvm 66 Nov  4 17:01 fdc629ee-f335-49f0-ba18-627298358070 -> /rhev/data-center/mnt/blockSD/fdc629ee-f335-49f0-ba18-627298358070
lrwxrwxrwx. 1 vdsm kvm 66 Nov  4 17:01 mastersd -> /rhev/data-center/mnt/blockSD/fdc629ee-f335-49f0-ba18-627298358070

^^ link to domain c5c9d15d-e375-4040-ba86-7a36d15e3072 does not exist ^^

Engine reports that the domain is in 'Active' status. When I try to create a new disk under this domain, I'm failing with:

7648b234-d07d-496a-83fd-a8e5852b75eb::ERROR::2013-11-05 09:42:22,623::task::850::TaskManager.Task::(_setError) Task=`7648b234-d07d-496a-83fd-a8e5852b75eb`::Unexpected error
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/task.py", line 857, in _run
    return fn(*args, **kargs)
  File "/usr/share/vdsm/storage/task.py", line 318, in run
    return self.cmd(*self.argslist, **self.argsdict)
  File "/usr/share/vdsm/storage/securable.py", line 68, in wrapper
    return f(self, *args, **kwargs)
  File "/usr/share/vdsm/storage/sp.py", line 1905, in createVolume
    desc=desc, srcImgUUID=srcImgUUID, srcVolUUID=srcVolUUID)
  File "/usr/share/vdsm/storage/sd.py", line 411, in createVolume
    preallocate, diskType, volUUID, desc, srcImgUUID, srcVolUUID)
  File "/usr/share/vdsm/storage/volume.py", line 406, in create
    imgPath = image.Image(repoPath).create(sdUUID, imgUUID)
  File "/usr/share/vdsm/storage/image.py", line 128, in create
    os.mkdir(imageDir)
OSError: [Errno 2] No such file or directory: '/rhev/data-center/e37919d1-5622-4186-8860-2db87106a70f/c5c9d15d-e375-4040-ba86-7a36d15e3072/images/d6221bbc-ee6a-4558-8e87-1d4081559560'


Version-Release number of selected component (if applicable):
vdsm-4.13.0-0.5.beta1.el6ev.x86_64

How reproducible:
100% reproducible when we got a situation in which the domain's link is missing.

Steps to Reproduce:
Happened to me on a block pool (iscsi) with 1 host connected. Need a scenario which the domain's link is missing, I'm not sure how I got it, it's still under investigation. 


Actual results:
vdsm reports that the domain is in 'active' status, even though the domain's link is missing from /rhev/data-center/SPUUID/ directory. This is causing to engine to report that the domain is active, and we fail to perform actions on this domain. 

Expected results:
When storage domain link is missing from /rhev/data-center/SPUUID/ directory, vdsm should not report that the domain is in 'active' status

Additional info:
logs

Comment 1 Ayal Baron 2013-11-11 14:56:40 UTC
Elad, any update on how you reached this situation to begin with?

Comment 2 Elad 2013-11-11 16:55:00 UTC
(In reply to Ayal Baron from comment #1)
> Elad, any update on how you reached this situation to begin with?

I Still don't know. 
But I think it is not relevant, IMO, VDSM should take under consideration the existing of the SD link under /rhev/data-center/ when it determinate the domain status.

Comment 5 Elad 2013-12-04 10:36:53 UTC
Created attachment 832561 [details]
logs for missing links under /rhev/data-center

I've a situation in which there are missing link of a SD under /rhev/data-center/

Attaching the logs

Comment 6 Ayal Baron 2013-12-04 13:04:24 UTC
retargetting considering the apparent ease of reproduction.

Comment 7 Elad 2013-12-04 13:05:50 UTC
Created attachment 832639 [details]
engine logs for missing links under /rhev/data-center

Comment 8 Elad 2013-12-05 08:25:49 UTC
Answering to Ayal's question from https://bugzilla.redhat.com/show_bug.cgi?id=986652 on comment 10:

(In reply to Ayal Baron from comment #10)
> (In reply to Elad from comment #9)
> > (In reply to Ayal Baron from comment #8)
> > > Elad, reproduction steps?
> > 
> > It happened when I blocked connection to storage server which one of the
> > domains in the pool is located on (not the master)
> while it was not active, right?
> and then you activated it at some point? and it did not recreate the links?

The domain was active. The last log of the domain status before the connectivity loss (the relevant domain is 3bbf19c4-3a0d-4a91-883a-9824245659ee):

2013-12-04 11:45:55,330 WARN  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (DefaultQuartzScheduler_Worker-21) Storage Domain 3bbf19c4-3a0d-4a91-883a-9824245659ee:iscsi5-3 was reported by Host nott-vds1 as Active in Pool c9fcc2ba-c091-4e29-861e-db132ea6e4b8, moving to active status

..........
The first log pointing on a problem with the domain on:

2013-12-04 11:46:25,589 WARN  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (pool-4-thread-48) domain 3bbf19c4-3a0d-4a91-883a-9824245659ee:iscsi5-3 in problem. vds: nott-vds1

..........

2013-12-04 11:51:25,642 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (pool-4-thread-49) Domain 3bbf19c4-3a0d-4a91-883a-9824245659ee:iscsi5-3 was reported by all hosts in status UP as problematic. Moving the domain to NonOperational.


..........
After I resumed connectivity with the storage, storage domain activation succeeds:

2013-12-04 11:52:39,012 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.ActivateStorageDomainVDSCommand] (pool-4-thread-44) [1033bdd6] START, ActivateStorageDomainVDSCommand( storagePoolId = c9fcc2ba-c091-4e29-86
1e-db132ea6e4b8, ignoreFailoverLimit = false, storageDomainId = 3bbf19c4-3a0d-4a91-883a-9824245659ee), log id: 7f9ce10b
2013-12-04 11:52:43,316 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.ActivateStorageDomainVDSCommand] (pool-4-thread-44) [1033bdd6] FINISH, ActivateStorageDomainVDSCommand, log id: 7f9ce10b

........

The domain is now active, trying to add a disk which located on the domain:

2013-12-04 11:53:15,563 INFO  [org.ovirt.engine.core.bll.AddDiskCommand] (ajp-/127.0.0.1:8702-5) [a95b472] Running command: AddDiskCommand internal: false. Entities affected :  ID: 3bbf19c4-3a0d-4a91-883a-9824245659ee Type: Storage


........
The add disk fails:

2013-12-04 11:53:24,166 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMGetAllTasksStatusesVDSCommand] (DefaultQuartzScheduler_Worker-21) Error code GeneralException and error message VDSGenericException: VDSErrorException: Failed to HSMGetAllTasksStatusesVDS, error = [Errno 2] No such file or directory: '/rhev/data-center/c9fcc2ba-c091-4e29-861e-db132ea6e4b8/3bbf19c4-3a0d-4a91-883a-9824245659ee/images/bb56f21e-ca43-4109-b1b5-d3e649a955ab'

Comment 9 Sergey Gotliv 2013-12-05 09:00:16 UTC
I am still investigating, but from VDSM logs it looks like the activation of Storage Domain '3bbf19c4-3a0d-4a91-883a-9824245659ee' is actually failed at 2013-12-04 11:52.

Start follow this tread (activateSD):

Thread-628::DEBUG::2013-12-04 11:52:08,802::task::974::TaskManager.Task::(_decref) Task=`bb144a1d-262f-4139-8e8f-46ef7907a32a`::ref 1 aborting False
Thread-628::INFO::2013-12-04 11:52:08,803::sp::1110::Storage.StoragePool::(activateSD) sdUUID=3bbf19c4-3a0d-4a91-883a-9824245659ee spUUID=c9fcc2ba-c091-4e29-861e-db132ea6e4b8

Finally get this:

Thread-628::DEBUG::2013-12-04 11:52:11,203::lvm::309::Storage.Misc.excCmd::(cmd) '/usr/bin/sudo -n /sbin/lvm vgs --config " devices { preferred_names = [\\"^/dev/mapper/\\"] ignore_suspended_devices=1 write_cache_state=0 disable_after_error_count=3 obtain_device_list_from_udev=0 filter = [ \'a|/dev/mapper/1elad1.12113859197|/dev/mapper/1elad1.12213859197|/dev/mapper/1elad1113738004|/dev/mapper/1elad1213738004|/dev/mapper/1elad1313738004|/dev/mapper/1elad1413738004|/dev/mapper/1elad1513738004|\', \'r|.*|\' ] }  global {  locking_type=1  prioritise_write_locks=1  wait_for_locks=1 }  backup {  retain_min = 50  retain_days = 0 } " --noheadings --units b --nosuffix --separator | --ignoreskippedcluster -o uuid,name,attr,size,free,extent_size,extent_count,free_count,tags,vg_mda_size,vg_mda_free,lv_count,pv_count,pv_name 3bbf19c4-3a0d-4a91-883a-9824245659ee' (cwd None)
Thread-170::DEBUG::2013-12-04 11:52:11,240::iscsiadm::91::Storage.Misc.excCmd::(_runCmd) SUCCESS: <err> = ''; <rc> = 0
Thread-170::DEBUG::2013-12-04 11:52:11,241::misc::819::SamplingMethod::(__call__) Returning last result
Thread-628::DEBUG::2013-12-04 11:52:11,436::lvm::309::Storage.Misc.excCmd::(cmd) FAILED: <err> = '  /dev/mapper/1elad1.12213859197: read failed after 0 of 4096 at 32212189184: Input/output error\n  /dev/mapper/1elad1.12213859197: read failed after 0 of 4096 at 32212246528: Input/output error\n  /dev/mapper/1elad1.12213859197: read failed after 0 of 4096 at 0: Input/output error\n  WARNING: Error counts reached a limit of 3. Device /dev/mapper/1elad1.12213859197 was disabled\n  /dev/mapper/1elad1.12113859197: read failed after 0 of 4096 at 32212189184: Input/output error\n  /dev/mapper/1elad1.12113859197: read failed after 0 of 4096 at 32212246528: Input/output error\n  /dev/mapper/1elad1.12113859197: read failed after 0 of 4096 at 0: Input/output error\n  WARNING: Error counts reached a limit of 3. Device /dev/mapper/1elad1.12113859197 was disabled\n  Volume group "3bbf19c4-3a0d-4a91-883a-9824245659ee" not found\n  Skipping volume group 3bbf19c4-3a0d-4a91-883a-9824245659ee\n'; <rc> = 5
Thread-628::WARNING::2013-12-04 11:52:11,439::lvm::391::Storage.LVM::(_reloadvgs) lvm vgs failed: 5 [] ['  /dev/mapper/1elad1.12213859197: read failed after 0 of 4096 at 32212189184: Input/output error', '  /dev/mapper/1elad1.12213859197: read failed after 0 of 4096 at 32212246528: Input/output error', '  /dev/mapper/1elad1.12213859197: read failed after 0 of 4096 at 0: Input/output error', '  WARNING: Error counts reached a limit of 3. Device /dev/mapper/1elad1.12213859197 was disabled', '  /dev/mapper/1elad1.12113859197: read failed after 0 of 4096 at 32212189184: Input/output error', '  /dev/mapper/1elad1.12113859197: read failed after 0 of 4096 at 32212246528: Input/output error', '  /dev/mapper/1elad1.12113859197: read failed after 0 of 4096 at 0: Input/output error', '  WARNING: Error counts reached a limit of 3. Device /dev/mapper/1elad1.12113859197 was disabled', '  Volume group "3bbf19c4-3a0d-4a91-883a-9824245659ee" not found', '  Skipping volume group 3bbf19c4-3a0d-4a91-883a-9824245659ee']
Thread-628::DEBUG::2013-12-04 11:52:11,439::lvm::428::OperationMutex::(_reloadvgs) Operation 'lvm reload operation' released the operation mutex
Thread-628::ERROR::2013-12-04 11:52:11,455::sdc::143::Storage.StorageDomainCache::(_findDomain) domain 3bbf19c4-3a0d-4a91-883a-9824245659ee not found
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/sdc.py", line 141, in _findDomain
    dom = findMethod(sdUUID)
  File "/usr/share/vdsm/storage/sdc.py", line 171, in _findUnfetchedDomain
    raise se.StorageDomainDoesNotExist(sdUUID)
StorageDomainDoesNotExist: Storage domain does not exist: ('3bbf19c4-3a0d-4a91-883a-9824245659ee',)
Thread-628::ERROR::2013-12-04 11:52:11,455::task::850::TaskManager.Task::(_setError) Task=`bb144a1d-262f-4139-8e8f-46ef7907a32a`::Unexpected error
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/task.py", line 857, in _run
    return fn(*args, **kargs)
  File "/usr/share/vdsm/logUtils.py", line 45, in wrapper
    res = f(*args, **kwargs)
  File "/usr/share/vdsm/storage/hsm.py", line 1237, in activateStorageDomain
    pool.activateSD(sdUUID)
  File "/usr/share/vdsm/storage/securable.py", line 68, in wrapper
    return f(self, *args, **kwargs)
  File "/usr/share/vdsm/storage/sp.py", line 1112, in activateSD
    dom = sdCache.produce(sdUUID)
  File "/usr/share/vdsm/storage/sdc.py", line 98, in produce
    domain.getRealDomain()
  File "/usr/share/vdsm/storage/sdc.py", line 52, in getRealDomain
    return self._cache._realProduce(self._sdUUID)
  File "/usr/share/vdsm/storage/sdc.py", line 122, in _realProduce
    domain = self._findDomain(sdUUID)
  File "/usr/share/vdsm/storage/sdc.py", line 141, in _findDomain
    dom = findMethod(sdUUID)
  File "/usr/share/vdsm/storage/sdc.py", line 171, in _findUnfetchedDomain
    raise se.StorageDomainDoesNotExist(sdUUID)
StorageDomainDoesNotExist: Storage domain does not exist: ('3bbf19c4-3a0d-4a91-883a-9824245659ee',)
Thread-628::DEBUG::2013-12-04 11:52:11,456::task::869::TaskManager.Task::(_run) Task=`bb144a1d-262f-4139-8e8f-46ef7907a32a`::Task._run: bb144a1d-262f-4139-8e8f-46ef7907a32a ('3bbf19c4-3a0d-4a91-883a-9824245659ee', 'c9fcc2ba-c091-4e29-861e-db132ea6e4b8') {} failed - stopping task
Thread-628::DEBUG::2013-12-04 11:52:11,456::task::1194::TaskManager.Task::(stop) Task=`bb144a1d-262f-4139-8e8f-46ef7907a32a`::stopping in state preparing (force False)

Comment 10 Sergey Gotliv 2013-12-11 20:14:28 UTC
*** Bug 986652 has been marked as a duplicate of this bug. ***

Comment 12 Sergey Gotliv 2013-12-17 15:10:48 UTC
*** Bug 1026268 has been marked as a duplicate of this bug. ***

Comment 13 Elad 2014-01-30 15:42:34 UTC
Created attachment 857548 [details]
engine and vdsm logs (30.1.14)

Reproduced again on is29, attaching the relevant logs

Comment 14 Eyal Edri 2014-02-10 10:29:08 UTC
moving to 3.3.2 since 3.3.1 was built and moved to QE.
please make sure to clone the bug if you want to hit z-stream release.

Comment 15 Tal Nisan 2014-02-13 13:56:03 UTC
*** Bug 949248 has been marked as a duplicate of this bug. ***

Comment 20 Elad 2014-02-26 10:31:22 UTC
Storage domain's links under /rhev/data-center/<pool-uuid>/ are restored once host got activated from non-operational state.

Verification steps:
1) Created a new shared DC with 2 host, created master domain, 1 host took SPM
2) Created second domain located on a different storage server
3) Blocked connectivity from HSM host to the non-master domain, host became non-operational, both domains remained 'active'.
While one of the pool's domain is unreachable to HSM host and it was non-operational, links of the 2nd domain under  /rhev/data-center/<pool-uuid>/ has disappeared.
4) Resumed connectivity from HSM host to non-master domain's storage server, host became 'up' and links were created again:

Thread-37706::INFO::2014-02-26 12:15:14,034::sp::1104::Storage.StoragePool::(_linkStorageDomain) Linking /rhev/data-center/mnt/lion.qa.lab.tlv.redhat.com:_export_elad_4/beeac7e6-643e-4708-b5f8-ae2c00f1d791 to /rhe
v/data-center/2b812207-5ac8-4927-9839-9706b7021296/beeac7e6-643e-4708-b5f8-ae2c00f1d791
Thread-37706::INFO::2014-02-26 12:15:14,035::sp::1104::Storage.StoragePool::(_linkStorageDomain) Linking /rhev/data-center/mnt/10.35.160.108:_RHEV_elad_9/f459e923-be62-475d-90a1-00d68dfa500d to /rhev/data-center/2
b812207-5ac8-4927-9839-9706b7021296/f459e923-be62-475d-90a1-00d68dfa500d
Thread-37706::INFO::2014-02-26 12:15:14,035::sp::1104::Storage.StoragePool::(_linkStorageDomain) Linking /rhev/data-center/mnt/lion.qa.lab.tlv.redhat.com:_export_elad_3/c595a297-7575-4950-94a2-ff51b77c53ad to /rhe
v/data-center/2b812207-5ac8-4927-9839-9706b7021296/c595a297-7575-4950-94a2-ff51b77c53ad
Thread-37706::INFO::2014-02-26 12:15:14,035::sp::1104::Storage.StoragePool::(_linkStorageDomain) Linking /rhev/data-center/mnt/lion.qa.lab.tlv.redhat.com:_export_elad_3/c595a297-7575-4950-94a2-ff51b77c53ad to /rhe
v/data-center/2b812207-5ac8-4927-9839-9706b7021296/mastersd



Verified with ovirt-engine-3.4-beta3:
vdsm-4.14.3-0.el6.x86_64
ovirt-engine-3.4.0-0.11.beta3.el6.noarch

Comment 21 Elad 2014-02-27 07:50:48 UTC
Compared with older version of vdsm (vdsm-4.13.2-0.5.el6ev.x86_64) - non-op host doesn't create links under /rhev/data-center/spuuid/ during activation, when it has full connectivity to the storage server.

Comment 22 errata-xmlrpc 2014-06-09 13:26:22 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2014-0504.html


Note You need to log in before you can comment on or make changes to this bug.