Created attachment 625004 [details] vdsm + engine logs Description of problem: In a setup with 2 NFS storage domains and 1 ISO domain (all on different servers) blocking the master domain leads to a reconstruct flow. The reconstructMaster command takes longer than 3 minutes to return when the master domain is still blocked. Version-Release number of selected component (if applicable): vdsm-4.9.6-37.0.el6_3.x86_64 How reproducible: 100% Steps to Reproduce: 1. Create a setup with more than 1 NFS storage domain (on different servers) and a single host 2. Block connection between host and the master storage domain 3. check vdsm logs for running times of reconstructMaster Actual results: Expected results: Additional info:
The logs show significant time gaps between operations in the reconstruct command, times when it is unclear exactly what is taking so long. There is a 2 minute gap in thread 500's logging between 17:09:34,189 and 17:11:34,581. Also later a 3 minute gap between 17:11:34,791 and 17:14:35,219. hread-500::DEBUG::2012-10-10 17:09:34,171::sp::1537::Storage.StoragePool::(getMasterDomain) Master domain 5efd1c4f-a4cb-4808-9190-65b98435385a verified, version 6 Thread-500::DEBUG::2012-10-10 17:09:34,172::misc::1080::SamplingMethod::(__call__) Trying to enter sampling method (storage.sp.updateMonitoringThreads) Thread-500::DEBUG::2012-10-10 17:09:34,172::misc::1082::SamplingMethod::(__call__) Got in to sampling method Thread-500::DEBUG::2012-10-10 17:09:34,175::persistentDict::226::Storage.PersistentDict::(refresh) read lines (FileMetadataRW)=['CLASS=Data', 'DESCRIPTION=Domain1', 'IOOPTIMEOUTSEC=10', 'LEASERETRIES=3', 'LEASETIM Thread-500::DEBUG::2012-10-10 17:09:34,177::sp::1571::Storage.StoragePool::(updateMonitoringThreads) Storage Pool `92093b80-07da-489b-b31c-78f9234665a1` started monitoring domain `9a3d9ecd-855b-4e51-b85b-0a604de4e Thread-500::DEBUG::2012-10-10 17:09:34,178::sp::1571::Storage.StoragePool::(updateMonitoringThreads) Storage Pool `92093b80-07da-489b-b31c-78f9234665a1` started monitoring domain `5efd1c4f-a4cb-4808-9190-65b984353 Thread-500::DEBUG::2012-10-10 17:09:34,181::sp::1571::Storage.StoragePool::(updateMonitoringThreads) Storage Pool `92093b80-07da-489b-b31c-78f9234665a1` started monitoring domain `f4e791b5-83af-4b65-96c7-6e30138fe Thread-500::DEBUG::2012-10-10 17:09:34,189::misc::1090::SamplingMethod::(__call__) Returning last result Thread-500::DEBUG::2012-10-10 17:11:34,581::lvm::352::OperationMutex::(_reloadvgs) Operation 'lvm reload operation' got the operation mutex Thread-500::DEBUG::2012-10-10 17:11:34,587::__init__::1164::Storage.Misc.excCmd::(_log) '/usr/bin/sudo -n /sbin/lvm vgs --config " devices { preferred_names = [\\"^/dev/mapper/\\"] ignore_suspended_devices=1 write Thread-500::DEBUG::2012-10-10 17:11:34,786::__init__::1164::Storage.Misc.excCmd::(_log) FAILED: <err> = ' Volume group "9a3d9ecd-855b-4e51-b85b-0a604de4e5f6" not found\n Volume group "f4e791b5-83af-4b65-96c7-6e3 Thread-500::WARNING::2012-10-10 17:11:34,790::lvm::356::Storage.LVM::(_reloadvgs) lvm vgs failed: 5 [] [' Volume group "9a3d9ecd-855b-4e51-b85b-0a604de4e5f6" not found', ' Volume group "f4e791b5-83af-4b65-96c7-6 Thread-500::DEBUG::2012-10-10 17:11:34,791::lvm::379::OperationMutex::(_reloadvgs) Operation 'lvm reload operation' released the operation mutex Thread-500::INFO::2012-10-10 17:14:35,219::sp::1132::Storage.StoragePool::(_linkStorageDomain) Linking /rhev/data-center/mnt/orion.qa.lab.tlv.redhat.com:_export_gickowic_nfs1/5efd1c4f-a4cb-4808-9190-65b98435385a t Thread-781::DEBUG::2012-10-10 17:14:55,382::BindingXMLRPC::171::vds::(wrapper) [10.35.97.5] Thread-781::DEBUG::2012-10-10 17:14:55,383::task::588::TaskManager.Task::(_updateState) Task=`9fe9b01a-ea70-4709-8ae4-eeb3d7a48e13`::moving from state init -> state preparing Thread-781::INFO::2012-10-10 17:14:55,383::logUtils::37::dispatcher::(wrapper) Run and protect: reconstructMaster(spUUID='92093b80-07da-489b-b31c-78f9234665a1', poolName='TestDataCenter', masterDom='5efd1c4f-a4cb- Thread-781::INFO::2012-10-10 17:14:55,383::hsm::1507::Storage.HSM::(reconstructMaster) spUUID=92093b80-07da-489b-b31c-78f9234665a1 master=5efd1c4f-a4cb-4808-9190-65b98435385a Thread-500::INFO::2012-10-10 17:15:35,414::sp::1132::Storage.StoragePool::(_linkStorageDomain) Linking /rhev/data-center/mnt/orion.qa.lab.tlv.redhat.com:_export_gickowic_nfs1/5efd1c4f-a4cb-4808-9190-65b98435385a t Thread-781::INFO::2012-10-10 17:15:35,415::fileSD::227::Storage.StorageDomain::(validate) sdUUID=5efd1c4f-a4cb-4808-9190-65b98435385a Thread-781::DEBUG::2012-10-10 17:15:35,431::persistentDict::226::Storage.PersistentDict::(refresh) read lines (FileMetadataRW)=['CLASS=Data', 'DESCRIPTION=Domain1', 'IOOPTIMEOUTSEC=10', 'LEASERETRIES=3', 'LEASETIM Thread-781::DEBUG::2012-10-10 17:15:35,432::resourceManager::175::ResourceManager.Request::(__init__) ResName=`Storage.92093b80-07da-489b-b31c-78f9234665a1`ReqID=`14409933-25da-4d9d-a144-afa0ea8b31f7`::Request was Thread-781::DEBUG::2012-10-10 17:15:35,433::resourceManager::486::ResourceManager::(registerResource) Trying to register resource 'Storage.92093b80-07da-489b-b31c-78f9234665a1' for lock type 'exclusive' Thread-781::DEBUG::2012-10-10 17:15:35,433::resourceManager::510::ResourceManager::(registerResource) Resource 'Storage.92093b80-07da-489b-b31c-78f9234665a1' is currently locked, Entering queue (4 in queue) Thread-781::DEBUG::2012-10-10 17:17:35,434::resourceManager::186::ResourceManager.Request::(cancel) ResName=`Storage.92093b80-07da-489b-b31c-78f9234665a1`ReqID=`14409933-25da-4d9d-a144-afa0ea8b31f7`::Canceled requ Thread-781::DEBUG::2012-10-10 17:17:35,438::resourceManager::705::ResourceManager.Owner::(acquire) 9fe9b01a-ea70-4709-8ae4-eeb3d7a48e13: request for 'Storage.92093b80-07da-489b-b31c-78f9234665a1' timed out after ' Thread-781::ERROR::2012-10-10 17:17:35,439::task::853::TaskManager.Task::(_setError) Task=`9fe9b01a-ea70-4709-8ae4-eeb3d7a48e13`::Unexpected error
The ~two minutes: Thread-500::DEBUG::2012-10-10 17:09:34,189::misc::1090::SamplingMethod::(__call__) Returning last result Thread-500::DEBUG::2012-10-10 17:11:34,581::lvm::352::OperationMutex::(_reloadvgs) Operation 'lvm reload operation' got the operation mutex Thread-500::DEBUG::2012-10-10 17:09:34,177::sp::1571::Storage.StoragePool::(updateMonitoringThreads) Storage Pool `92093b80-07da-489b-b31c-78f9234665a1` started monitoring domain `9a3d9ecd-855b-4e51-b85b-0a604de4e5f6` Thread-500::DEBUG::2012-10-10 17:09:34,178::sp::1571::Storage.StoragePool::(updateMonitoringThreads) Storage Pool `92093b80-07da-489b-b31c-78f9234665a1` started monitoring domain `5efd1c4f-a4cb-4808-9190-65b98435385a` Thread-500::DEBUG::2012-10-10 17:09:34,181::sp::1571::Storage.StoragePool::(updateMonitoringThreads) Storage Pool `92093b80-07da-489b-b31c-78f9234665a1` started monitoring domain `f4e791b5-83af-4b65-96c7-6e30138feb77` Thread-500::DEBUG::2012-10-10 17:09:34,189::misc::1090::SamplingMethod::(__call__) Returning last result Thread-500::DEBUG::2012-10-10 17:11:34,581::lvm::352::OperationMutex::(_reloadvgs) Operation 'lvm reload operation' got the operation mutex Thread-500::DEBUG::2012-10-10 17:11:34,587::__init__::1164::Storage.Misc.excCmd::(_log) '/usr/bin/sudo -n /sbin/lvm vgs --config " devices { preferred_names = [\\"^/dev/mapper/\\"] ignore_suspended_devices=1 write_cache_state=0 disable_after_error_count=3 filter = [ \\"r%.*%\\" ] } global { locking_type=1 prioritise_write_locks=1 wait_for_locks=1 } backup { retain_min = 50 retain_days = 0 } " --noheadings --units b --nosuffix --separator | -o uuid,name,attr,size,free,extent_size,extent_count,free_count,tags,vg_mda_size,vg_mda_free 9a3d9ecd-855b-4e51-b85b-0a604de4e5f6 f4e791b5-83af-4b65-96c7-6e30138feb77' (cwd None) Thread-500::DEBUG::2012-10-10 17:11:34,786::__init__::1164::Storage.Misc.excCmd::(_log) FAILED: <err> = ' Volume group "9a3d9ecd-855b-4e51-b85b-0a604de4e5f6" not found\n Volume group "f4e791b5-83af-4b65-96c7-6e30138feb77" not found\n'; <rc> = 5 Thread-500::WARNING::2012-10-10 17:11:34,790::lvm::356::Storage.LVM::(_reloadvgs) lvm vgs failed: 5 [] [' Volume group "9a3d9ecd-855b-4e51-b85b-0a604de4e5f6" not found', ' Volume group "f4e791b5-83af-4b65-96c7-6e30138feb77" not found'] Thread-500::DEBUG::2012-10-10 17:11:34,791::lvm::379::OperationMutex::(_reloadvgs) Operation 'lvm reload operation' released the operation mutex Thread-500::DEBUG::2012-10-10 17:09:33,974::lvm::352::OperationMutex::(_reloadvgs) Operation 'lvm reload operation' got the operation mutex Thread-500::DEBUG::2012-10-10 17:09:34,144::lvm::379::OperationMutex::(_reloadvgs) Operation 'lvm reload operation' released the operation mutex Thread-544::DEBUG::2012-10-10 17:09:34,178::lvm::352::OperationMutex::(_reloadvgs) Operation 'lvm reload operation' got the operation mutex Thread-544::DEBUG::2012-10-10 17:09:34,349::lvm::379::OperationMutex::(_reloadvgs) Operation 'lvm reload operation' released the operation mutex Thread-546::DEBUG::2012-10-10 17:10:34,367::lvm::352::OperationMutex::(_reloadvgs) Operation 'lvm reload operation' got the operation mutex Thread-546::DEBUG::2012-10-10 17:10:34,538::lvm::379::OperationMutex::(_reloadvgs) Operation 'lvm reload operation' released the operation mutex Thread-500::DEBUG::2012-10-10 17:11:34,581::lvm::352::OperationMutex::(_reloadvgs) Operation 'lvm reload operation' got the operation mutex Thread-544::DEBUG::2012-10-10 17:11:34,586::lvm::352::OperationMutex::(_reloadvgs) Got the operational mutex Thread-500::DEBUG::2012-10-10 17:11:34,791::lvm::379::OperationMutex::(_reloadvgs) Operation 'lvm reload operation' released the operation mutex
We can see from the log that the operation mutex is taken and released during the two minutes for other threads.
*** Bug 870756 has been marked as a duplicate of this bug. ***
according to Sergey and Ayal, this fix won't be part of rhel 6.5 - nothing to do then.
Sean, this too depends on getting rid of master domain work
Liron, now that the pool metadata is memory only, what's left to do?
Allon, having the pool metadata in memory should fasten things up - but not solve stuff completely as we access the storage and that can always be problematic. We can see how relevant is this bug and if it is then look into improving and optimizing the current reconstruct flow, if it's an issue that happens but relatively rarely i assume that we can have this bug solved in dc without spm.
this is an automated message. oVirt 3.6.0 RC3 has been released and GA is targeted to next week, Nov 4th 2015. Please review this bug and if not a blocker, please postpone to a later release. All bugs not postponed on GA release will be automatically re-targeted to - 3.6.1 if severity >= high - 4.0 if severity < high
nijin, I've inspected the case logs, while the issue of long lasting connectStoragePool does appear there I think it's prefered to handle the issue in a different BZ: 1. From the logs it appears that vdsm is flooded with getVolumeSize() calls, were they executed by the user? why? 2. vdsm gets timeouts when collecting the vms stats which leads to the vms to move to Non responding by the engine, for some reason it seems like we have only 10 minutes of log of each day (without the relevant time) - so i can't seem to be able to get further information. Nir, have you encountered that issue before? VM Channels Listener::DEBUG::2015-11-06 10:05:45,601::vmChannels::95::vds::(_handle_timeouts) Timeout on fileno 15. VM Channels Listener::DEBUG::2015-11-06 10:05:45,603::vmChannels::95::vds::(_handle_timeouts) Timeout on fileno 148. VM Channels Listener::DEBUG::2015-11-06 10:05:45,604::vmChannels::95::vds::(_handle_timeouts) Timeout on fileno 157. VM Channels Listener::DEBUG::2015-11-06 10:05:45,604::vmChannels::95::vds::(_handle_timeouts) Timeout on fileno 35. 3. for some reason, repoStats() isn't logged as called by the engine (as part of getVdsStats) for few minutes , when it does the reported values are very high. That seems as related to 1+2. Nir, does anything from the above rings a bell? nijin, let's continue the handling of this issue in separate bug with the proper logs. thanks, Liron.
(In reply to Liron Aravot from comment #22) > for some reason it seems like we > have only 10 minutes of log of each day (without the relevant time) - so i > can't seem to be able to get further information. Nir, have you encountered > that issue before? No > VM Channels Listener::DEBUG::2015-11-06 > 10:05:45,601::vmChannels::95::vds::(_handle_timeouts) Timeout on fileno 15. > VM Channels Listener::DEBUG::2015-11-06 > 10:05:45,603::vmChannels::95::vds::(_handle_timeouts) Timeout on fileno 148. > VM Channels Listener::DEBUG::2015-11-06 > 10:05:45,604::vmChannels::95::vds::(_handle_timeouts) Timeout on fileno 157. > VM Channels Listener::DEBUG::2015-11-06 > 10:05:45,604::vmChannels::95::vds::(_handle_timeouts) Timeout on fileno 35. This means guest agent is not responding, ask Vinzenz for help on this. > 3. for some reason, repoStats() isn't logged as called by the engine (as > part of getVdsStats) for few minutes , when it does the reported values are > very high. That seems as related to 1+2. > > > Nir, does anything from the above rings a bell? Yes, it reminds me another bug (I don't remember the number) where virt guys took a lock on vdsm connection, delaying repoStat calls for several minutes. This should be solved in 3.6. Michal can help with this.
Sorry, forgot to add - if the issue reproduces we can try to enable lvm logging on that machine, currently it's disabled. nijin - you could attach the var/log/messages file from that date, currently it's missing. and a further question- nijin, if the issue does reproduce 100%, is that always on the same host? thanks, Liron.
Added the update in the BUG #1293991 as I have opened a new bug for this.
Created attachment 1136159 [details] vdsm.log, messages and engine.log Not sure if this issue still occurs. An attempt to reproduce (2 data NFS domains reside on different storage servers) resulted with the following: After blocking the master domain: Thread-50::ERROR::2016-03-14 14:49:53,952::monitor::276::Storage.Monitor::(_monitorDomain) Error monitoring domain 95ed6a3c-07fb-4846-a16a-36ae568969db Traceback (most recent call last): File "/usr/share/vdsm/storage/monitor.py", line 272, in _monitorDomain self._performDomainSelftest() File "/usr/lib/python2.7/site-packages/vdsm/utils.py", line 769, in wrapper value = meth(self, *a, **kw) File "/usr/share/vdsm/storage/monitor.py", line 339, in _performDomainSelftest self.domain.selftest() File "/usr/share/vdsm/storage/fileSD.py", line 644, in selftest self.oop.os.statvfs(self.domaindir) File "/usr/share/vdsm/storage/outOfProcess.py", line 274, in statvfs return self._iop.statvfs(path) File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line 454, in statvfs resdict = self._sendCommand("statvfs", {"path": path}, self.timeout) File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line 421, in _sendCommand raise Timeout(os.strerror(errno.ETIMEDOUT)) Timeout: Connection timed out storageRefresh::DEBUG::2016-03-14 14:51:07,976::lvm::497::Storage.OperationMutex::(_invalidateAllPvs) Operation 'lvm invalidate operation' got the operation mutex storageRefresh::DEBUG::2016-03-14 14:51:07,976::lvm::499::Storage.OperationMutex::(_invalidateAllPvs) Operation 'lvm invalidate operation' released the operation mutex storageRefresh::DEBUG::2016-03-14 14:51:07,977::lvm::508::Storage.OperationMutex::(_invalidateAllVgs) Operation 'lvm invalidate operation' got the operation mutex storageRefresh::DEBUG::2016-03-14 14:51:07,977::lvm::510::Storage.OperationMutex::(_invalidateAllVgs) Operation 'lvm invalidate operation' released the operation mutex storageRefresh::DEBUG::2016-03-14 14:51:07,978::lvm::528::Storage.OperationMutex::(_invalidateAllLvs) Operation 'lvm invalidate operation' got the operation mutex storageRefresh::DEBUG::2016-03-14 14:51:07,978::lvm::530::Storage.OperationMutex::(_invalidateAllLvs) Operation 'lvm invalidate operation' released the operation mutex storageRefresh::DEBUG::2016-03-14 14:51:07,978::misc::760::Storage.SamplingMethod::(__call__) Returning last result storageRefresh::DEBUG::2016-03-14 14:51:07,979::lvm::319::Storage.OperationMutex::(_reloadpvs) Operation 'lvm reload operation' got the operation mutex storageRefresh::DEBUG::2016-03-14 14:51:08,162::lvm::347::Storage.OperationMutex::(_reloadpvs) Operation 'lvm reload operation' released the operation mutex storageRefresh::DEBUG::2016-03-14 14:51:08,162::lvm::370::Storage.OperationMutex::(_reloadvgs) Operation 'lvm reload operation' got the operation mutex jsonrpc.Executor/0::INFO::2016-03-14 14:54:30,402::logUtils::48::dispatcher::(wrapper) Run and protect: reconstructMaster(spUUID=u'00000001-0001-0001-0001-0000000000c8', poolName=u'Default', masterDom=u'7323a71f-0038-4f9b-90f4-1cecc427859c', domDict={u'7323a71f-0038-4f9b-90f4-1cecc427859c': u'active', u'fe703b42-0858-4b4b-815e-4b0f1edf956f': u'attached', u'377fffda-340e-45dc-9574-f9ff99b55a7e': u'attached', u'5fa224b8-ab80-4b97-89c9-0b13da1a3879': u'attached', u'95ed6a3c-07fb-4846-a16a-36ae568969db': u'active'}, masterVersion=6, lockPolicy=None, lockRenewalIntervalSec=5, leaseTimeSec=60, ioOpTimeoutSec=10, leaseRetries=3, hostId=1, options=None) jsonrpc.Executor/0::INFO::2016-03-14 14:55:31,104::clusterlock::325::Storage.SANLock::(release) Releasing cluster lock for domain 7323a71f-0038-4f9b-90f4-1cecc427859c jsonrpc.Executor/0::DEBUG::2016-03-14 14:55:31,107::clusterlock::335::Storage.SANLock::(release) Cluster lock for domain 7323a71f-0038-4f9b-90f4-1cecc427859c successfully released jsonrpc.Executor/0::INFO::2016-03-14 14:55:31,107::logUtils::51::dispatcher::(wrapper) Run and protect: reconstructMaster, Return response: None ============================================================================ ReconstructMaster itself took ~1 minute to complete (14:54:30,402 to 14:55:31,107).
As it seems to me, this bug can be closed. Is there any further issue we'd like to solve here?
(In reply to Liron Aravot from comment #29) > As it seems to me, this bug can be closed. Is there any further issue we'd > like to solve here? Nope, closing.