Description of problem: engine reject tasks for thread pool util on high scale, specially for 'ConnectDomainToStorageCommand' 2017-04-02 17:55:01,465+03 WARN [org.ovirt.engine.core.utils.threadpool.ThreadPoolUtil] (DefaultQuartzScheduler24) [296ae34] The thread pool failed to execute list of tasks: Task java.util.concurrent.FutureTask@7dc52f6e rejected from org.ovirt.engine.core.utils.threadpool.ThreadPoolUtil$InternalThreadExecutor@3275fd7b[Running, pool size = 500, active threads = 87, queued tasks = 100, completed tasks = 88300] 2017-04-02 17:55:01,465+03 ERROR [org.ovirt.engine.core.bll.storage.connection.ConnectDomainToStorageCommand] (DefaultQuartzScheduler24) [296ae34] Command 'org.ovirt.engine.core.bll.storage.connection.ConnectDomainToStorageCommand' failed: java.util.concurrent.RejectedExecutionException: Task java.util.concurrent.FutureTask@7dc52f6e rejected from org.ovirt.engine.core.utils.threadpool.ThreadPoolUtil$InternalThreadExecutor@3275fd7b[Running, pool size = 500, active threads = 87, queued tasks = 100, completed tasks = 88300] 2017-04-02 17:55:01,465+03 ERROR [org.ovirt.engine.core.bll.storage.connection.ConnectDomainToStorageCommand] (DefaultQuartzScheduler24) [296ae34] Exception: java.lang.RuntimeException: java.util.concurrent.RejectedExecutionException: Task java.util.concurrent.FutureTask@7dc52f6e rejected from org.ovirt.engine.core.utils.threadpool.ThreadPoolUtil$InternalThreadExecutor@3275fd7b[Running, pool size = 500, active threads = 87, queued tasks = 100, completed tasks = 88300] Caused by: java.util.concurrent.RejectedExecutionException: Task java.util.concurrent.FutureTask@7dc52f6e rejected from org.ovirt.engine.core.utils.threadpool.ThreadPoolUtil$InternalThreadExecutor@3275fd7b[Running, pool size = 500, active threads = 87, queued tasks = 100, completed tasks = 88300] at java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2047) [rt.jar:1.8.0_121] at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:823) [rt.jar:1.8.0_121] seems like we can patch it as we did for the quertz pool on https://bugzilla.redhat.com/show_bug.cgi?id=1429534 Version-Release number of selected component (if applicable): 4.1.1.2 How reproducible: Not clear Steps to Reproduce: 1. 254 hosts (250 nested), 3 SDs (ISCSI), ~1400 real vms with 3 disks each (thins provision) 2. while ceating new vms. Actual results: task being rejected, creating vms running for long time and seems like they task rejected. also 'ConnectDomainToStorageCommand' task rejected which may cause storage issues and prevent from vms to be created. Expected results: RejectPolicy as introduced here https://gerrit.ovirt.org/#/c/74034/ may resolve the problem, but apparently postponed tasks, will handle after long time, since they way back in the queue. Additional info:
Interesting find Eldad. It looks like ThreadPoolUtil rejects the tasks like quartz pool before due to bounded queue size [1]. We may want to fix it in the same way. [1] https://github.com/oVirt/ovirt-engine/blob/master/backend/manager/modules/utils/src/main/java/org/ovirt/engine/core/utils/threadpool/ThreadPoolUtil.java#L41
We will add the same policy to block the thread which is trying to add a new task into exhausted queue as we used in Quartz thread pool. This will slow down engine (task are not rejected but waiting), but we really need to take a look at related storage flow in 4.2 and optimized it not to require that many threads.
(In reply to Martin Perina from comment #3) > We will add the same policy to block the thread which is trying to add a new > task into exhausted queue as we used in Quartz thread pool. This will slow > down engine (task are not rejected but waiting), but we really need to take > a look at related storage flow in 4.2 and optimized it not to require that > many threads. do you mind if I'll open a new bug about slow storage commands for 4.2 in order to tack it accordingly.
(In reply to Eldad Marciano from comment #4) > (In reply to Martin Perina from comment #3) > > We will add the same policy to block the thread which is trying to add a new > > task into exhausted queue as we used in Quartz thread pool. This will slow > > down engine (task are not rejected but waiting), but we really need to take > > a look at related storage flow in 4.2 and optimized it not to require that > > many threads. > > do you mind if I'll open a new bug about slow storage commands for 4.2 in > order to tack it accordingly. Feel free to do so ...
verification status bug was verified with vdsmfake 250 hosts, 1 SD, 2350 vms, 4340 overall disks. on top of this topology we created vms in bulks of 10, 50, 90. no tasks were rejected. also we simulate high latency with vdsmfake in order stress more the utils queue. we'll keep playing more with stressing the queue and will move this bug to verified ASAP. currently there is no evidence for tasks rejection. ==== engine error rate ==== JdbcConnectionException - 0 Tasks rejected - 0 VM not responding - 0 Thread interrupts - 0
this bug has verified on top of rhv 4.1.2.1-0.1.el7. topology: 250 hosts, 1 SD, 2846 VMs, 4847 disks. we stressed the engine to more than 100 tasks per second, no tasks were rejected. moving this bug to verified.