Bug 865034 - VDSM reconstructMaster takes longer than 3 minutes to return when a storage domain is blocked
Summary: VDSM reconstructMaster takes longer than 3 minutes to return when a storage d...
Keywords:
Status: CLOSED UPSTREAM
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: vdsm
Version: 3.1.0
Hardware: Unspecified
OS: Linux
medium
high
Target Milestone: ovirt-4.0.0-alpha
: 4.0.0
Assignee: Liron Aravot
QA Contact: Elad
URL:
Whiteboard:
: 870756 (view as bug list)
Depends On: 1058022
Blocks:
TreeView+ depends on / blocked
 
Reported: 2012-10-10 16:33 UTC by Gadi Ickowicz
Modified: 2019-11-30 09:14 UTC (History)
16 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-03-23 13:09:49 UTC
oVirt Team: Storage
Target Upstream Version:
Embargoed:
scohen: needinfo+
scohen: Triaged+


Attachments (Terms of Use)
vdsm + engine logs (45.78 MB, application/x-gzip)
2012-10-10 16:33 UTC, Gadi Ickowicz
no flags Details
vdsm.log, messages and engine.log (1016.16 KB, application/x-gzip)
2016-03-14 13:58 UTC, Elad
no flags Details

Description Gadi Ickowicz 2012-10-10 16:33:08 UTC
Created attachment 625004 [details]
vdsm + engine logs

Description of problem:
In a setup with 2 NFS storage domains and 1 ISO domain (all on different servers) blocking the master domain leads to a reconstruct flow.

The reconstructMaster command takes longer than 3 minutes to return when the master domain is still blocked.

Version-Release number of selected component (if applicable):
vdsm-4.9.6-37.0.el6_3.x86_64

How reproducible:
100%

Steps to Reproduce:
1. Create a setup with more than 1 NFS storage domain (on different servers) and a single host
2. Block connection between host and the master storage domain
3. check vdsm logs for running times of reconstructMaster
  
Actual results:


Expected results:


Additional info:

Comment 2 Gadi Ickowicz 2012-10-11 09:10:03 UTC
The logs show significant time gaps between operations in the reconstruct command, times when it is unclear exactly what is taking so long. There is a 2 minute gap in thread 500's logging between 17:09:34,189 and 17:11:34,581. Also later a 3 minute gap between 17:11:34,791 and 17:14:35,219.


hread-500::DEBUG::2012-10-10 17:09:34,171::sp::1537::Storage.StoragePool::(getMasterDomain) Master domain 5efd1c4f-a4cb-4808-9190-65b98435385a verified, version 6
Thread-500::DEBUG::2012-10-10 17:09:34,172::misc::1080::SamplingMethod::(__call__) Trying to enter sampling method (storage.sp.updateMonitoringThreads)
Thread-500::DEBUG::2012-10-10 17:09:34,172::misc::1082::SamplingMethod::(__call__) Got in to sampling method
Thread-500::DEBUG::2012-10-10 17:09:34,175::persistentDict::226::Storage.PersistentDict::(refresh) read lines (FileMetadataRW)=['CLASS=Data', 'DESCRIPTION=Domain1', 'IOOPTIMEOUTSEC=10', 'LEASERETRIES=3', 'LEASETIM
Thread-500::DEBUG::2012-10-10 17:09:34,177::sp::1571::Storage.StoragePool::(updateMonitoringThreads) Storage Pool `92093b80-07da-489b-b31c-78f9234665a1` started monitoring domain `9a3d9ecd-855b-4e51-b85b-0a604de4e
Thread-500::DEBUG::2012-10-10 17:09:34,178::sp::1571::Storage.StoragePool::(updateMonitoringThreads) Storage Pool `92093b80-07da-489b-b31c-78f9234665a1` started monitoring domain `5efd1c4f-a4cb-4808-9190-65b984353
Thread-500::DEBUG::2012-10-10 17:09:34,181::sp::1571::Storage.StoragePool::(updateMonitoringThreads) Storage Pool `92093b80-07da-489b-b31c-78f9234665a1` started monitoring domain `f4e791b5-83af-4b65-96c7-6e30138fe
Thread-500::DEBUG::2012-10-10 17:09:34,189::misc::1090::SamplingMethod::(__call__) Returning last result
Thread-500::DEBUG::2012-10-10 17:11:34,581::lvm::352::OperationMutex::(_reloadvgs) Operation 'lvm reload operation' got the operation mutex
Thread-500::DEBUG::2012-10-10 17:11:34,587::__init__::1164::Storage.Misc.excCmd::(_log) '/usr/bin/sudo -n /sbin/lvm vgs --config " devices { preferred_names = [\\"^/dev/mapper/\\"] ignore_suspended_devices=1 write
Thread-500::DEBUG::2012-10-10 17:11:34,786::__init__::1164::Storage.Misc.excCmd::(_log) FAILED: <err> = '  Volume group "9a3d9ecd-855b-4e51-b85b-0a604de4e5f6" not found\n  Volume group "f4e791b5-83af-4b65-96c7-6e3
Thread-500::WARNING::2012-10-10 17:11:34,790::lvm::356::Storage.LVM::(_reloadvgs) lvm vgs failed: 5 [] ['  Volume group "9a3d9ecd-855b-4e51-b85b-0a604de4e5f6" not found', '  Volume group "f4e791b5-83af-4b65-96c7-6
Thread-500::DEBUG::2012-10-10 17:11:34,791::lvm::379::OperationMutex::(_reloadvgs) Operation 'lvm reload operation' released the operation mutex
Thread-500::INFO::2012-10-10 17:14:35,219::sp::1132::Storage.StoragePool::(_linkStorageDomain) Linking /rhev/data-center/mnt/orion.qa.lab.tlv.redhat.com:_export_gickowic_nfs1/5efd1c4f-a4cb-4808-9190-65b98435385a t
Thread-781::DEBUG::2012-10-10 17:14:55,382::BindingXMLRPC::171::vds::(wrapper) [10.35.97.5]
Thread-781::DEBUG::2012-10-10 17:14:55,383::task::588::TaskManager.Task::(_updateState) Task=`9fe9b01a-ea70-4709-8ae4-eeb3d7a48e13`::moving from state init -> state preparing
Thread-781::INFO::2012-10-10 17:14:55,383::logUtils::37::dispatcher::(wrapper) Run and protect: reconstructMaster(spUUID='92093b80-07da-489b-b31c-78f9234665a1', poolName='TestDataCenter', masterDom='5efd1c4f-a4cb-
Thread-781::INFO::2012-10-10 17:14:55,383::hsm::1507::Storage.HSM::(reconstructMaster) spUUID=92093b80-07da-489b-b31c-78f9234665a1 master=5efd1c4f-a4cb-4808-9190-65b98435385a
Thread-500::INFO::2012-10-10 17:15:35,414::sp::1132::Storage.StoragePool::(_linkStorageDomain) Linking /rhev/data-center/mnt/orion.qa.lab.tlv.redhat.com:_export_gickowic_nfs1/5efd1c4f-a4cb-4808-9190-65b98435385a t
Thread-781::INFO::2012-10-10 17:15:35,415::fileSD::227::Storage.StorageDomain::(validate) sdUUID=5efd1c4f-a4cb-4808-9190-65b98435385a
Thread-781::DEBUG::2012-10-10 17:15:35,431::persistentDict::226::Storage.PersistentDict::(refresh) read lines (FileMetadataRW)=['CLASS=Data', 'DESCRIPTION=Domain1', 'IOOPTIMEOUTSEC=10', 'LEASERETRIES=3', 'LEASETIM
Thread-781::DEBUG::2012-10-10 17:15:35,432::resourceManager::175::ResourceManager.Request::(__init__) ResName=`Storage.92093b80-07da-489b-b31c-78f9234665a1`ReqID=`14409933-25da-4d9d-a144-afa0ea8b31f7`::Request was
Thread-781::DEBUG::2012-10-10 17:15:35,433::resourceManager::486::ResourceManager::(registerResource) Trying to register resource 'Storage.92093b80-07da-489b-b31c-78f9234665a1' for lock type 'exclusive'
Thread-781::DEBUG::2012-10-10 17:15:35,433::resourceManager::510::ResourceManager::(registerResource) Resource 'Storage.92093b80-07da-489b-b31c-78f9234665a1' is currently locked, Entering queue (4 in queue)
Thread-781::DEBUG::2012-10-10 17:17:35,434::resourceManager::186::ResourceManager.Request::(cancel) ResName=`Storage.92093b80-07da-489b-b31c-78f9234665a1`ReqID=`14409933-25da-4d9d-a144-afa0ea8b31f7`::Canceled requ
Thread-781::DEBUG::2012-10-10 17:17:35,438::resourceManager::705::ResourceManager.Owner::(acquire) 9fe9b01a-ea70-4709-8ae4-eeb3d7a48e13: request for 'Storage.92093b80-07da-489b-b31c-78f9234665a1' timed out after '
Thread-781::ERROR::2012-10-10 17:17:35,439::task::853::TaskManager.Task::(_setError) Task=`9fe9b01a-ea70-4709-8ae4-eeb3d7a48e13`::Unexpected error

Comment 5 Eduardo Warszawski 2012-10-30 15:06:45 UTC
The ~two minutes:

Thread-500::DEBUG::2012-10-10 17:09:34,189::misc::1090::SamplingMethod::(__call__) Returning last result
Thread-500::DEBUG::2012-10-10 17:11:34,581::lvm::352::OperationMutex::(_reloadvgs) Operation 'lvm reload operation' got the operation mutex


Thread-500::DEBUG::2012-10-10 17:09:34,177::sp::1571::Storage.StoragePool::(updateMonitoringThreads) Storage Pool `92093b80-07da-489b-b31c-78f9234665a1` started monitoring domain `9a3d9ecd-855b-4e51-b85b-0a604de4e5f6`
Thread-500::DEBUG::2012-10-10 17:09:34,178::sp::1571::Storage.StoragePool::(updateMonitoringThreads) Storage Pool `92093b80-07da-489b-b31c-78f9234665a1` started monitoring domain `5efd1c4f-a4cb-4808-9190-65b98435385a`
Thread-500::DEBUG::2012-10-10 17:09:34,181::sp::1571::Storage.StoragePool::(updateMonitoringThreads) Storage Pool `92093b80-07da-489b-b31c-78f9234665a1` started monitoring domain `f4e791b5-83af-4b65-96c7-6e30138feb77`
Thread-500::DEBUG::2012-10-10 17:09:34,189::misc::1090::SamplingMethod::(__call__) Returning last result
Thread-500::DEBUG::2012-10-10 17:11:34,581::lvm::352::OperationMutex::(_reloadvgs) Operation 'lvm reload operation' got the operation mutex
Thread-500::DEBUG::2012-10-10 17:11:34,587::__init__::1164::Storage.Misc.excCmd::(_log) '/usr/bin/sudo -n /sbin/lvm vgs --config " devices { preferred_names = [\\"^/dev/mapper/\\"] ignore_suspended_devices=1 write_cache_state=0 disable_after_error_count=3 filter = [ \\"r%.*%\\" ] }  global {  locking_type=1  prioritise_write_locks=1  wait_for_locks=1 }  backup {  retain_min = 50  retain_days = 0 } " --noheadings --units b --nosuffix --separator | -o uuid,name,attr,size,free,extent_size,extent_count,free_count,tags,vg_mda_size,vg_mda_free 9a3d9ecd-855b-4e51-b85b-0a604de4e5f6 f4e791b5-83af-4b65-96c7-6e30138feb77' (cwd None)
Thread-500::DEBUG::2012-10-10 17:11:34,786::__init__::1164::Storage.Misc.excCmd::(_log) FAILED: <err> = '  Volume group "9a3d9ecd-855b-4e51-b85b-0a604de4e5f6" not found\n  Volume group "f4e791b5-83af-4b65-96c7-6e30138feb77" not found\n'; <rc> = 5
Thread-500::WARNING::2012-10-10 17:11:34,790::lvm::356::Storage.LVM::(_reloadvgs) lvm vgs failed: 5 [] ['  Volume group "9a3d9ecd-855b-4e51-b85b-0a604de4e5f6" not found', '  Volume group "f4e791b5-83af-4b65-96c7-6e30138feb77" not found']
Thread-500::DEBUG::2012-10-10 17:11:34,791::lvm::379::OperationMutex::(_reloadvgs) Operation 'lvm reload operation' released the operation mutex





Thread-500::DEBUG::2012-10-10 17:09:33,974::lvm::352::OperationMutex::(_reloadvgs) Operation 'lvm reload operation' got the operation mutex
Thread-500::DEBUG::2012-10-10 17:09:34,144::lvm::379::OperationMutex::(_reloadvgs) Operation 'lvm reload operation' released the operation mutex
Thread-544::DEBUG::2012-10-10 17:09:34,178::lvm::352::OperationMutex::(_reloadvgs) Operation 'lvm reload operation' got the operation mutex
Thread-544::DEBUG::2012-10-10 17:09:34,349::lvm::379::OperationMutex::(_reloadvgs) Operation 'lvm reload operation' released the operation mutex
Thread-546::DEBUG::2012-10-10 17:10:34,367::lvm::352::OperationMutex::(_reloadvgs) Operation 'lvm reload operation' got the operation mutex
Thread-546::DEBUG::2012-10-10 17:10:34,538::lvm::379::OperationMutex::(_reloadvgs) Operation 'lvm reload operation' released the operation mutex
Thread-500::DEBUG::2012-10-10 17:11:34,581::lvm::352::OperationMutex::(_reloadvgs) Operation 'lvm reload operation' got the operation mutex
Thread-544::DEBUG::2012-10-10 17:11:34,586::lvm::352::OperationMutex::(_reloadvgs) Got the operational mutex
Thread-500::DEBUG::2012-10-10 17:11:34,791::lvm::379::OperationMutex::(_reloadvgs) Operation 'lvm reload operation' released the operation mutex

Comment 6 Eduardo Warszawski 2012-10-31 09:17:49 UTC
We can see from the log that the operation mutex is taken and released during the two minutes for other threads.

Comment 7 Eduardo Warszawski 2012-11-03 22:02:06 UTC
*** Bug 870756 has been marked as a duplicate of this bug. ***

Comment 12 Aharon Canan 2013-09-24 11:53:16 UTC
according to Sergey and Ayal, 
this fix won't be part of rhel 6.5 - nothing to do then.

Comment 13 Ayal Baron 2014-02-23 13:11:25 UTC
Sean, this too depends on getting rid of master domain work

Comment 17 Allon Mureinik 2015-07-06 16:48:59 UTC
Liron, now that the pool metadata is memory only, what's left to do?

Comment 18 Liron Aravot 2015-08-20 13:45:28 UTC
Allon, having the pool metadata in memory should fasten things up - but not solve stuff completely as we access the storage and that can always be problematic.
We can see how relevant is this bug and if it is then look into improving and optimizing the current reconstruct flow, if it's an issue that happens but relatively rarely i assume that we can have this bug solved in dc without spm.

Comment 19 Sandro Bonazzola 2015-10-26 12:35:53 UTC
this is an automated message. oVirt 3.6.0 RC3 has been released and GA is targeted to next week, Nov 4th 2015.
Please review this bug and if not a blocker, please postpone to a later release.
All bugs not postponed on GA release will be automatically re-targeted to

- 3.6.1 if severity >= high
- 4.0 if severity < high

Comment 22 Liron Aravot 2015-12-13 14:06:28 UTC
nijin,
I've inspected the case logs, while the issue of long lasting connectStoragePool does appear there I think it's prefered to handle the issue in a different BZ:

1. From the logs it appears that vdsm is flooded with getVolumeSize() calls, were they executed by the user? why?

2. vdsm gets timeouts when collecting the vms stats which leads to the vms to move to Non responding by the engine, for some reason it seems like we have only 10 minutes of log of each day (without the relevant time) - so i can't seem to be able to get further information. Nir, have you encountered that issue before?

VM Channels Listener::DEBUG::2015-11-06 10:05:45,601::vmChannels::95::vds::(_handle_timeouts) Timeout on fileno 15.
VM Channels Listener::DEBUG::2015-11-06 10:05:45,603::vmChannels::95::vds::(_handle_timeouts) Timeout on fileno 148.
VM Channels Listener::DEBUG::2015-11-06 10:05:45,604::vmChannels::95::vds::(_handle_timeouts) Timeout on fileno 157.
VM Channels Listener::DEBUG::2015-11-06 10:05:45,604::vmChannels::95::vds::(_handle_timeouts) Timeout on fileno 35.

3. for some reason, repoStats() isn't logged as called by the engine (as part of getVdsStats) for few minutes , when it does the reported values are very high. That seems as related to 1+2.


Nir, does anything from the above rings a bell?

nijin, let's continue the handling of this issue in separate bug with the proper logs.

thanks,
Liron.

Comment 23 Nir Soffer 2015-12-13 14:28:09 UTC
(In reply to Liron Aravot from comment #22)
> for some reason it seems like we
> have only 10 minutes of log of each day (without the relevant time) - so i
> can't seem to be able to get further information. Nir, have you encountered
> that issue before?

No
 
> VM Channels Listener::DEBUG::2015-11-06
> 10:05:45,601::vmChannels::95::vds::(_handle_timeouts) Timeout on fileno 15.
> VM Channels Listener::DEBUG::2015-11-06
> 10:05:45,603::vmChannels::95::vds::(_handle_timeouts) Timeout on fileno 148.
> VM Channels Listener::DEBUG::2015-11-06
> 10:05:45,604::vmChannels::95::vds::(_handle_timeouts) Timeout on fileno 157.
> VM Channels Listener::DEBUG::2015-11-06
> 10:05:45,604::vmChannels::95::vds::(_handle_timeouts) Timeout on fileno 35.

This means guest agent is not responding, ask Vinzenz for help on this.

> 3. for some reason, repoStats() isn't logged as called by the engine (as
> part of getVdsStats) for few minutes , when it does the reported values are
> very high. That seems as related to 1+2.
> 
> 
> Nir, does anything from the above rings a bell?

Yes, it reminds me another bug (I don't remember the number) where virt guys
took a lock on vdsm connection, delaying repoStat calls for several minutes.
This should be solved in 3.6. Michal can help with this.

Comment 24 Liron Aravot 2016-01-07 13:39:12 UTC
Sorry, forgot to add -

if the issue reproduces we can try to enable lvm logging on that machine, currently it's disabled.

nijin - you could attach the var/log/messages file from that date, currently it's missing.

and a further question- nijin, if the issue does reproduce 100%, is that always on the same host?

thanks, Liron.

Comment 25 nijin ashok 2016-01-14 13:24:08 UTC
Added the update in the BUG #1293991 as I have opened a new bug for this.

Comment 28 Elad 2016-03-14 13:58:00 UTC
Created attachment 1136159 [details]
vdsm.log, messages and engine.log

Not sure if this issue still occurs. 

An attempt to reproduce (2 data NFS domains reside on different storage servers) resulted with the following:

After blocking the master domain:


Thread-50::ERROR::2016-03-14 14:49:53,952::monitor::276::Storage.Monitor::(_monitorDomain) Error monitoring domain 95ed6a3c-07fb-4846-a16a-36ae568969db
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/monitor.py", line 272, in _monitorDomain
    self._performDomainSelftest()
  File "/usr/lib/python2.7/site-packages/vdsm/utils.py", line 769, in wrapper
    value = meth(self, *a, **kw)
  File "/usr/share/vdsm/storage/monitor.py", line 339, in _performDomainSelftest
    self.domain.selftest()
  File "/usr/share/vdsm/storage/fileSD.py", line 644, in selftest
    self.oop.os.statvfs(self.domaindir)
  File "/usr/share/vdsm/storage/outOfProcess.py", line 274, in statvfs
    return self._iop.statvfs(path)
  File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line 454, in statvfs
    resdict = self._sendCommand("statvfs", {"path": path}, self.timeout)
  File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line 421, in _sendCommand
    raise Timeout(os.strerror(errno.ETIMEDOUT))
Timeout: Connection timed out


storageRefresh::DEBUG::2016-03-14 14:51:07,976::lvm::497::Storage.OperationMutex::(_invalidateAllPvs) Operation 'lvm invalidate operation' got the operation mutex
storageRefresh::DEBUG::2016-03-14 14:51:07,976::lvm::499::Storage.OperationMutex::(_invalidateAllPvs) Operation 'lvm invalidate operation' released the operation mutex
storageRefresh::DEBUG::2016-03-14 14:51:07,977::lvm::508::Storage.OperationMutex::(_invalidateAllVgs) Operation 'lvm invalidate operation' got the operation mutex
storageRefresh::DEBUG::2016-03-14 14:51:07,977::lvm::510::Storage.OperationMutex::(_invalidateAllVgs) Operation 'lvm invalidate operation' released the operation mutex
storageRefresh::DEBUG::2016-03-14 14:51:07,978::lvm::528::Storage.OperationMutex::(_invalidateAllLvs) Operation 'lvm invalidate operation' got the operation mutex
storageRefresh::DEBUG::2016-03-14 14:51:07,978::lvm::530::Storage.OperationMutex::(_invalidateAllLvs) Operation 'lvm invalidate operation' released the operation mutex
storageRefresh::DEBUG::2016-03-14 14:51:07,978::misc::760::Storage.SamplingMethod::(__call__) Returning last result
storageRefresh::DEBUG::2016-03-14 14:51:07,979::lvm::319::Storage.OperationMutex::(_reloadpvs) Operation 'lvm reload operation' got the operation mutex


storageRefresh::DEBUG::2016-03-14 14:51:08,162::lvm::347::Storage.OperationMutex::(_reloadpvs) Operation 'lvm reload operation' released the operation mutex
storageRefresh::DEBUG::2016-03-14 14:51:08,162::lvm::370::Storage.OperationMutex::(_reloadvgs) Operation 'lvm reload operation' got the operation mutex




jsonrpc.Executor/0::INFO::2016-03-14 14:54:30,402::logUtils::48::dispatcher::(wrapper) Run and protect: reconstructMaster(spUUID=u'00000001-0001-0001-0001-0000000000c8', poolName=u'Default', masterDom=u'7323a71f-0038-4f9b-90f4-1cecc427859c', domDict={u'7323a71f-0038-4f9b-90f4-1cecc427859c': u'active', u'fe703b42-0858-4b4b-815e-4b0f1edf956f': u'attached', u'377fffda-340e-45dc-9574-f9ff99b55a7e': u'attached', u'5fa224b8-ab80-4b97-89c9-0b13da1a3879': u'attached', u'95ed6a3c-07fb-4846-a16a-36ae568969db': u'active'}, masterVersion=6, lockPolicy=None, lockRenewalIntervalSec=5, leaseTimeSec=60, ioOpTimeoutSec=10, leaseRetries=3, hostId=1, options=None)


jsonrpc.Executor/0::INFO::2016-03-14 14:55:31,104::clusterlock::325::Storage.SANLock::(release) Releasing cluster lock for domain 7323a71f-0038-4f9b-90f4-1cecc427859c
jsonrpc.Executor/0::DEBUG::2016-03-14 14:55:31,107::clusterlock::335::Storage.SANLock::(release) Cluster lock for domain 7323a71f-0038-4f9b-90f4-1cecc427859c successfully released
jsonrpc.Executor/0::INFO::2016-03-14 14:55:31,107::logUtils::51::dispatcher::(wrapper) Run and protect: reconstructMaster, Return response: None

============================================================================
ReconstructMaster itself took ~1 minute to complete (14:54:30,402 to 14:55:31,107).

Comment 29 Liron Aravot 2016-03-23 12:34:12 UTC
As it seems to me, this bug can be closed. Is there any further issue we'd like to solve here?

Comment 30 Allon Mureinik 2016-03-23 13:09:49 UTC
(In reply to Liron Aravot from comment #29)
> As it seems to me, this bug can be closed. Is there any further issue we'd
> like to solve here?
Nope, closing.


Note You need to log in before you can comment on or make changes to this bug.