Bug 968349
Summary: | [scale] [storage] ConnectStorageServer failed - The thread pool is out of limit (engine finish its thread pool) | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Virtualization Manager | Reporter: | vvyazmin <vvyazmin> | ||||||||||
Component: | ovirt-engine | Assignee: | Allon Mureinik <amureini> | ||||||||||
Status: | CLOSED CURRENTRELEASE | QA Contact: | Eldad Marciano <emarcian> | ||||||||||
Severity: | urgent | Docs Contact: | |||||||||||
Priority: | unspecified | ||||||||||||
Version: | 3.2.0 | CC: | acanan, amureini, bazulay, bsettle, gklein, juan.hernandez, lpeer, oourfali, rbalakri, Rhev-m-bugs, scohen, tnisan, yeylon, ylavi | ||||||||||
Target Milestone: | ovirt-3.6.0-rc | Keywords: | ZStream | ||||||||||
Target Release: | 3.6.0 | ||||||||||||
Hardware: | x86_64 | ||||||||||||
OS: | Linux | ||||||||||||
Whiteboard: | |||||||||||||
Fixed In Version: | ovirt-engine-3.6.0_qa1 | Doc Type: | Bug Fix | ||||||||||
Doc Text: | Story Points: | --- | |||||||||||
Clone Of: | |||||||||||||
: | 1007393 1106572 1185633 (view as bug list) | Environment: | |||||||||||
Last Closed: | 2016-03-10 10:33:12 UTC | Type: | Bug | ||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||
Documentation: | --- | CRM: | |||||||||||
Verified Versions: | Category: | --- | |||||||||||
oVirt Team: | Storage | RHEL 7.3 requirements from Atomic Host: | |||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||
Embargoed: | |||||||||||||
Bug Depends On: | 1007393, 1176517 | ||||||||||||
Bug Blocks: | 1106572, 1185633 | ||||||||||||
Attachments: |
|
2013-05-29 12:17:43,095 ERROR [org.ovirt.engine.core.bll.storage.ConnectSingleAsyncOperation] (pool-4-thread-921) [5f248318] Failed to connect host Fake_Host_032 to storage pool 005_Fake_Host_DataCenter. Exception: {3}: org.ovirt.engine.core.common.errors.VdcBLLException: VdcBLLException: java.util.concurrent.RejectedExecutionException: Task java.util.concurrent.FutureTask@7f1a1a76 rejected from org.ovirt.engine.core.utils.threadpool.ThreadPoolUtil$InternalThreadExecutor@1dfc445b[Running, pool size = 500, active threads = 500, queued tasks = 0, completed tasks = 326411] at org.ovirt.engine.core.bll.VdsHandler.handleVdsResult(VdsHandler.java:169) [engine-bll.jar:] at org.ovirt.engine.core.bll.VDSBrokerFrontendImpl.RunVdsCommand(VDSBrokerFrontendImpl.java:33) [engine-bll.jar:] at org.ovirt.engine.core.bll.storage.ISCSIStorageHelper.runConnectionStorageToDomain(ISCSIStorageHelper.java:54) [engine-bll.jar:] at org.ovirt.engine.core.bll.storage.ISCSIStorageHelper.runConnectionStorageToDomain(ISCSIStorageHelper.java:29) [engine-bll.jar:] at org.ovirt.engine.core.bll.storage.ISCSIStorageHelper.connectStorageToDomainByVdsId(ISCSIStorageHelper.java:216) [engine-bll.jar:] at org.ovirt.engine.core.bll.storage.ConnectSingleAsyncOperation.execute(ConnectSingleAsyncOperation.java:18) [engine-bll.jar:] at org.ovirt.engine.core.utils.SyncronizeNumberOfAsyncOperations$AsyncOpThread.call(SyncronizeNumberOfAsyncOperations.java:42) [engine-utils.jar:] at org.ovirt.engine.core.utils.SyncronizeNumberOfAsyncOperations$AsyncOpThread.call(SyncronizeNumberOfAsyncOperations.java:31) [engine-utils.jar:] at org.ovirt.engine.core.utils.threadpool.ThreadPoolUtil$InternalCallable.call(ThreadPoolUtil.java:99) [engine-utils.jar:] at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) [rt.jar:1.7.0_19] at java.util.concurrent.FutureTask.run(FutureTask.java:166) [rt.jar:1.7.0_19] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [rt.jar:1.7.0_19] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [rt.jar:1.7.0_19] at java.lang.Thread.run(Thread.java:722) [rt.jar:1.7.0_19] The last thread dump is from 2013-05-29 at 16:15, but the log message is from 12:17, much earlier, and it contains only 217 threads, and only 50 of those are from the exhausted pool. Can we reproduce this and get a thread dump exactly when the log message appears? Looking at the engine.log.1 log file I see that the operation to activate the first 5 storage domains where started, which produces 250 asynchronous invocations to VDSM, thus consuming 250 threads, in addition those 250 operations are invoked by threads that run in the same pool which seem to wait for them to finish, this makes 500 threads in total, so there is no way we can handle this with the current size of the pool. I guess we would need at least 800 threads (50 hosts * 8 domains * 2 threads) in the pool to be able to perform this operation: # psql -U engine engine=> update vdc_options set option_value = '800' where option_name = 'DefaultMaxThreadPoolSize'; engine=> \q # service ovirt-engine restart But 800 threads is unreasonable in my opinion, I think that instead of using so many threads we should have a thread pool with less threads and a queue of tasks (currently we don't use a queue). Juan - this was my assumption when the reporter came to me and talked about this. I agree we should consider using a queue. Also, if you look at XmlRpcUtils you see we create the AsyncProxy for the call, so far a storage domain addition we use two threads (actually , at first I thought that the list at ISCSIStorageHelper will use more threads, but the code of XmlRpcUtils uses the Future object to limit the call with a timeout, so I assume this wil happen with all storage domain types, and not just ISCSI. Am I correct here? If we can reproduce this problem, may I ask a jstack dump (we need 3)? Vlad, please see question in comment 26 - why is this a test blocker? Does it block anything except scale testing? thanks 1. It's block another BZ and Test Plans. 2. Thread pool is out of limit, and I failed do other tasks in engine. 1. This bug come from scenario of BZ613180. 2. Today, I don't know if there is BZ which blocked by this issue. Removing blocker flag based on the removal of the TestBlocker and comments above. I think that what we should do here is make both the number of threads and the size of the queue configurable, to start with. The number of threads is already configurable, using DefaultMaxThreadPoolSize. I think we should add a new configuration paramater to control the size of the queue. Lets say that the name is ThreadPoolQueueSize. The logic when creating the thread pool should be to create a bounded queue with that maximum size. This bounded queue can be a LinkedBlockingQueue or an ArrayBlockingQueue, nor really much difference, but it should have a limited capacity in any case. The default size for this parameter should be 0, so that we don't modify the current behavior by default. These queue implementations my not support 0 as the size, so we may need to do something like this when creating the work queue for the thread pool: BlockquingQueue<Runnable> workQueue = null; int size = Config.<Integer> GetValue(ConfigValues.ThreadPoolQueueSize); if (size <= 0) { workQueue = new SynchornousQueue<Runnable>(); } else { workQueue = new LinkedBlockingQueue<Runnable>(size); } These parameters should also be available for modification with the engine-config tool. Currently they can only be modified directly in the database. I think they should be available in the engine-config tool, so that GSS and users can modify them easily. The change to make this configurable should go upstream, in RHEV 3.3 and also in RHEV 3.2.z, if possible. Once this is configurable we should then do another change to reduce the number of threads and to enlarge the size of the queue by default. That change should go in 3.3 but not in 3.2.z. Ayal, Barak, please see Juan's comment and the proposed solution (which should be done in several stages). The bug is targeted as urgent and 3.3 - do you confirm this solution for 3.3? IMHO, this solution can help us overcome issues "at field" thanks to the configuration, but maybe we should revisit the flows themselves, if possible and understand why we're using that amount of threads, and if its possible to reduce them? There are multiple things we can do here: Specifically for the connect issue: 1. call a single connectStorageServer for all connections required of the same type (connectStorageServer supports this already) 2. move to manageConnections API (async connect and host will automatically connect whenever disconnected). However, this is a temporary relief (which should still be done) that would not solve the general issue here which is that in a scale environment we will need to change the basic way we work. If we have >500 hosts then we would hit this issue regardless of the changes above. What needs to be done is change the way vdsBroker works to be totally async and then we can reduce number of threads required to 1 or 2 (even for thousands of hosts, assuming the engine can handle the processing of so many requests and responses). (In reply to Ayal Baron from comment #36) > There are multiple things we can do here: > > Specifically for the connect issue: > 1. call a single connectStorageServer for all connections required of the > same type (connectStorageServer supports this already) This is irrelevant for the scenario discussed here as long as the design remains as is. Consider manually going over all the SDs and activating them. Each one sends its own set of connectStorageServer calls - where would you do the grouping? What we could do is create a new command, ActivateMultipleStorageDomains to handle this, but there is still nothing to block a user from activating these domains one by one. > 2. move to manageConnections API (async connect and host will automatically > connect whenever disconnected). +3! > > However, this is a temporary relief (which should still be done) that would > not solve the general issue here which is that in a scale environment we > will need to change the basic way we work. > If we have >500 hosts then we would hit this issue regardless of the changes > above. > > What needs to be done is change the way vdsBroker works to be totally async > and then we can reduce number of threads required to 1 or 2 (even for > thousands of hosts, assuming the engine can handle the processing of so many > requests and responses). This applies to a lot more than just connectstorageserver and requires a different infra to work async against vdsm (in the least). Please check http://gerrit.ovirt.org/#/c/19744/. I think this bug is a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1007393https://bugzilla.redhat.com/show_bug.cgi?id=1007393 (In reply to Liran Zelkha from comment #40) > Please check http://gerrit.ovirt.org/#/c/19744/. > I think this bug is a duplicate of > https://bugzilla.redhat.com/show_bug.cgi?id=1007393 Seems as though it is indeed the same root cause. Moving to MODIFIED so QA can independently verify when 3.4.0 verifications come up. I unsuccessfully tried verify bug on RHEV-M 3.5.0-0.22.el6ev RHEL - 6Server - 6.6.0.2.el6 libvirt-0.10.2-46.el6_6.1 vdsm-4.16.7.6-1.el6ev Created: 2 Data Centers - Real and Fake Real DC has physical host which run 50 VMs (4CPUs, 2G RAM) and NetApp NFS Storage Fake DC has 50 fake hosts and 10 ISCSI Storage Domains After several minutes after activation last ISCSI Storage Domain Data Center went to status "Contending" and all ISCSI Storage Domains to Unknown status. Errors found in engine.log: 2014-12-22 03:10:43,678 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMGetTaskStatusVDSCommand] (DefaultQuartzScheduler_Worker-60) [38c55ac7] Failed in HSMGetTaskStatusVDS method 2014-12-22 03:10:43,678 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStartVDSCommand] (DefaultQuartzScheduler_Worker-60) [38c55ac7] Start SPM Task failed - result: cleanSuccess, message: VDSGenericException: VDSErrorException: Failed to HSMGetTaskStatusVDS, error = [Errno 19] Could not find dm device named `unknown device`, code = 100 2014-12-22 03:10:43,777 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (DefaultQuartzScheduler_Worker-60) [701e7d60] IrsBroker::Failed::GetStoragePoolInfoVDS due to: IrsSpmStartFailedException: IRSGenericException: IRSErrorException: SpmStart failed vdsm.log Thread-27528::ERROR::2014-12-22 08:01:08,343::domainMonitor::256::Storage.DomainMonitorThread::(_monitorDomain) Error while collecting domain 1b6d6d26-9e81-4b05-8913-20d10a0bb779 monitoring information Traceback (most recent call last): File "/usr/share/vdsm/storage/domainMonitor.py", line 221, in _monitorDomain self.domain = sdCache.produce(self.sdUUID) File "/usr/share/vdsm/storage/sdc.py", line 98, in produce domain.getRealDomain() File "/usr/share/vdsm/storage/sdc.py", line 52, in getRealDomain return self._cache._realProduce(self._sdUUID) File "/usr/share/vdsm/storage/sdc.py", line 122, in _realProduce domain = self._findDomain(sdUUID) File "/usr/share/vdsm/storage/sdc.py", line 141, in _findDomain dom = findMethod(sdUUID) File "/usr/share/vdsm/storage/blockSD.py", line 1360, in findDomain return BlockStorageDomain(BlockStorageDomain.findDomainPath(sdUUID)) File "/usr/share/vdsm/storage/blockSD.py", line 421, in __init__ lvm.checkVGBlockSizes(sdUUID, (self.logBlkSize, self.phyBlkSize)) File "/usr/share/vdsm/storage/lvm.py", line 1020, in checkVGBlockSizes _checkpvsblksize(pvs, vgBlkSize) File "/usr/share/vdsm/storage/lvm.py", line 997, in _checkpvsblksize pvBlkSize = _getpvblksize(pv) File "/usr/share/vdsm/storage/lvm.py", line 991, in _getpvblksize dev = devicemapper.getDmId(os.path.basename(pv)) File "/usr/share/vdsm/storage/devicemapper.py", line 39, in getDmId deviceMultipathName) OSError: [Errno 19] Could not find dm device named `unknown device` Uploaded new engine and vdsm log files Created attachment 971901 [details]
engine rhevm 3.5 log
Created attachment 971902 [details]
vdsm rhevm 3.5 log 1
Created attachment 971903 [details]
vdsm rhevm 3.5 log 2
(In reply to Yuri Obshansky from comment #47) > I unsuccessfully tried verify bug on > RHEV-M 3.5.0-0.22.el6ev > RHEL - 6Server - 6.6.0.2.el6 > libvirt-0.10.2-46.el6_6.1 > vdsm-4.16.7.6-1.el6ev > Created: > 2 Data Centers - Real and Fake > Real DC has physical host which run 50 VMs (4CPUs, 2G RAM) and NetApp NFS > Storage > Fake DC has 50 fake hosts and 10 ISCSI Storage Domains > After several minutes after activation last ISCSI Storage Domain > Data Center went to status "Contending" and all ISCSI Storage Domains to > Unknown status. > Errors found in engine.log: > 2014-12-22 03:10:43,678 ERROR > [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMGetTaskStatusVDSCommand] > (DefaultQuartzScheduler_Worker-60) [38c55ac7] Failed in HSMGetTaskStatusVDS > method > 2014-12-22 03:10:43,678 ERROR > [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStartVDSCommand] > (DefaultQuartzScheduler_Worker-60) [38c55ac7] Start SPM Task failed - > result: cleanSuccess, message: VDSGenericException: VDSErrorException: > Failed to HSMGetTaskStatusVDS, error = [Errno 19] Could not find dm device > named `unknown device`, code = 100 > 2014-12-22 03:10:43,777 ERROR > [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] > (DefaultQuartzScheduler_Worker-60) [701e7d60] > IrsBroker::Failed::GetStoragePoolInfoVDS due to: IrsSpmStartFailedException: > IRSGenericException: IRSErrorException: SpmStart failed > > vdsm.log > Thread-27528::ERROR::2014-12-22 > 08:01:08,343::domainMonitor::256::Storage.DomainMonitorThread:: > (_monitorDomain) Error while collecting domain > 1b6d6d26-9e81-4b05-8913-20d10a0bb779 monitoring information > Traceback (most recent call last): > File "/usr/share/vdsm/storage/domainMonitor.py", line 221, in > _monitorDomain > self.domain = sdCache.produce(self.sdUUID) > File "/usr/share/vdsm/storage/sdc.py", line 98, in produce > domain.getRealDomain() > File "/usr/share/vdsm/storage/sdc.py", line 52, in getRealDomain > return self._cache._realProduce(self._sdUUID) > File "/usr/share/vdsm/storage/sdc.py", line 122, in _realProduce > domain = self._findDomain(sdUUID) > File "/usr/share/vdsm/storage/sdc.py", line 141, in _findDomain > dom = findMethod(sdUUID) > File "/usr/share/vdsm/storage/blockSD.py", line 1360, in findDomain > return BlockStorageDomain(BlockStorageDomain.findDomainPath(sdUUID)) > File "/usr/share/vdsm/storage/blockSD.py", line 421, in __init__ > lvm.checkVGBlockSizes(sdUUID, (self.logBlkSize, self.phyBlkSize)) > File "/usr/share/vdsm/storage/lvm.py", line 1020, in checkVGBlockSizes > _checkpvsblksize(pvs, vgBlkSize) > File "/usr/share/vdsm/storage/lvm.py", line 997, in _checkpvsblksize > pvBlkSize = _getpvblksize(pv) > File "/usr/share/vdsm/storage/lvm.py", line 991, in _getpvblksize > dev = devicemapper.getDmId(os.path.basename(pv)) > File "/usr/share/vdsm/storage/devicemapper.py", line 39, in getDmId > deviceMultipathName) > OSError: [Errno 19] Could not find dm device named `unknown device` > > Uploaded new engine and vdsm log files This issue has nothing to do with the thread pool - can you please open a separate bug on it (and presumably block this one on it?), so we can track it effectively? Yes, done. Open blocker for this bug - Bug 1176517 - [scale] Data Center crashing and contending forever. All SDs are Unknown/Inactive. (In reply to Yuri Obshansky from comment #52) > Yes, done. > Open blocker for this bug - Bug 1176517 - [scale] Data Center crashing and > contending forever. All SDs are Unknown/Inactive. Thanks! Let's take it up there. Allon, Can we push the verification of this BZ to 3.5.1? It is currently blocked on a fix for BZ 1176517, which is only expected in 3.5.1. (In reply to Gil Klein from comment #54) > Allon, Can we push the verification of this BZ to 3.5.1? > > It is currently blocked on a fix for BZ 1176517, which is only expected in > 3.5.1. Sure, I'm fine with that. I'll push it out and create a z-stream clone. (In reply to Allon Mureinik from comment #55) > (In reply to Gil Klein from comment #54) > > Allon, Can we push the verification of this BZ to 3.5.1? > > > > It is currently blocked on a fix for BZ 1176517, which is only expected in > > 3.5.1. > Sure, I'm fine with that. > I'll push it out and create a z-stream clone. On second though, wait sec - this BZ was verified in 3.4.1 (see the z-stream clone, 1106572). Do you think it's worth the effort to re-verify it for 3.5.1 and 3.6.0? Since this behavior did not change in 3.5.0 (it's just an different bug that blocks you from even getting to this scenario), wouldn't it be preferable to just close this one with CURRENT_RELEASE? Since this bug depends on a bug that will only be included on 3.5.1. I'm moving this one to 3.5.1 as well, so it can be verified. (In reply to Yaniv Dary from comment #57) > Since this bug depends on a bug that will only be included on 3.5.1. I'm > moving this one to 3.5.1 as well, so it can be verified. Minor correction - this bug doesn't really depend on bug 1176517, but it does block QA verification on it. (In reply to Allon Mureinik from comment #58) > (In reply to Yaniv Dary from comment #57) > > Since this bug depends on a bug that will only be included on 3.5.1. I'm > > moving this one to 3.5.1 as well, so it can be verified. > Minor correction - this bug doesn't really depend on bug 1176517, but it > does block QA verification on it. Also, resetting to MODIFIED - as there isn't a 3.5.1 build yet, this BZ cannot be ON_QA. RHEV 3.6.0 has been released, setting status to CLOSED CURRENTRELEASE RHEV 3.6.0 has been released, setting status to CLOSED CURRENTRELEASE RHEV 3.6.0 has been released, setting status to CLOSED CURRENTRELEASE RHEV 3.6.0 has been released, setting status to CLOSED CURRENTRELEASE |
Created attachment 754401 [details] ## Logs rhevm, vdsm, libvirt, thread dump Description of problem: ConnectStorageServer failed - The thread pool is out of limit Version-Release number of selected component (if applicable): RHEVM 3.2 - SF17.1 environment: RHEVM: rhevm-3.2.0-11.28.el6ev.noarch VDSM: vdsm-4.10.2-21.0.el6ev.x86_64 LIBVIRT: libvirt-0.10.2-18.el6_4.5.x86_64 QEMU & KVM: qemu-kvm-rhev-0.12.1.2-2.355.el6_4.3.x86_64 SANLOCK: sanlock-2.6-2.el6.x86_64 How reproducible: 100% Steps to Reproduce: 1. Create iSCSI DC with 50 hosts (in my case 50 fake hosts) 2. Add 9 Storage Domain, when each SD consists from 8 VGS Before run second step I have following state: SD-01 – Active (Master) SD-02 – Maintenance SD-03 – Maintenance SD-04 – Maintenance SD-05 – Maintenance SD-06 – Maintenance SD-07 – Maintenance SD-08 – Maintenance SD-09 – Maintenance 3. Now try activate 8 Storage Domain (SD-02, SD-03,..., SD-09) Actual results: The thread pool is out of limit (maximum 500 threads) Expected results: No problems with amount of thread Impact on user: Workaround: Additional info: /var/log/ovirt-engine/engine.log 2013-05-29 12:17:43,090 WARN [org.ovirt.engine.core.utils.threadpool.ThreadPoolUtil] (pool-4-thread-951) [5f248318] The thread pool is out of limit. The submitted event was rej ected 2013-05-29 12:17:43,090 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.VdsBrokerCommand] (pool-4-thread-951) [5f248318] Failed in ConnectStorageServerVDS method, for vds: Fake_Host_050; host: 10.35.162.22 2013-05-29 12:17:43,090 ERROR [org.ovirt.engine.core.vdsbroker.VDSCommandBase] (pool-4-thread-951) [5f248318] Command ConnectStorageServerVDS execution failed. Exception: RejectedExecutionException: Task java.util.concurrent.FutureTask@36e4616e rejected from org.ovirt.engine.core.utils.threadpool.ThreadPoolUtil$InternalThreadExecutor@1dfc445b[Running, pool size = 500, active threads = 500, queued tasks = 0, completed tasks = 326411] 2013-05-29 12:17:43,090 WARN [org.ovirt.engine.core.utils.threadpool.ThreadPoolUtil] (pool-4-thread-959) [5f248318] The thread pool is out of limit. The submitted event was rejected /var/log/vdsm/vdsm.log