Bug 1438497 - [scale] - tasks rejection by thread pool util
Summary: [scale] - tasks rejection by thread pool util
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: ovirt-engine
Classification: oVirt
Component: Backend.Core
Version: 4.1.1.2
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: ovirt-4.1.2
: 4.1.2
Assignee: Martin Perina
QA Contact: Eldad Marciano
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-04-03 14:20 UTC by Eldad Marciano
Modified: 2017-05-23 08:18 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-05-23 08:18:31 UTC
oVirt Team: Infra
Embargoed:
rule-engine: ovirt-4.1+
rule-engine: blocker+


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
oVirt gerrit 75138 0 ovirt-engine-4.1 MERGED utils: do not reject tasks 2017-04-07 07:46:26 UTC
oVirt gerrit 75229 0 master MERGED utils: do not reject tasks 2017-04-07 07:38:23 UTC

Description Eldad Marciano 2017-04-03 14:20:45 UTC
Description of problem:

engine reject tasks for thread pool util on high scale, specially for 'ConnectDomainToStorageCommand'

2017-04-02 17:55:01,465+03 WARN  [org.ovirt.engine.core.utils.threadpool.ThreadPoolUtil] (DefaultQuartzScheduler24) [296ae34] The thread pool failed to execute list of tasks: Task java.util.concurrent.FutureTask@7dc52f6e rejected from org.ovirt.engine.core.utils.threadpool.ThreadPoolUtil$InternalThreadExecutor@3275fd7b[Running, pool size = 500, active threads = 87, queued tasks = 100, completed tasks = 88300]

2017-04-02 17:55:01,465+03 ERROR [org.ovirt.engine.core.bll.storage.connection.ConnectDomainToStorageCommand] (DefaultQuartzScheduler24) [296ae34] Command 'org.ovirt.engine.core.bll.storage.connection.ConnectDomainToStorageCommand' failed: java.util.concurrent.RejectedExecutionException: Task java.util.concurrent.FutureTask@7dc52f6e rejected from org.ovirt.engine.core.utils.threadpool.ThreadPoolUtil$InternalThreadExecutor@3275fd7b[Running, pool size = 500, active threads = 87, queued tasks = 100, completed tasks = 88300]

2017-04-02 17:55:01,465+03 ERROR [org.ovirt.engine.core.bll.storage.connection.ConnectDomainToStorageCommand] (DefaultQuartzScheduler24) [296ae34] Exception: java.lang.RuntimeException: java.util.concurrent.RejectedExecutionException: Task java.util.concurrent.FutureTask@7dc52f6e rejected from org.ovirt.engine.core.utils.threadpool.ThreadPoolUtil$InternalThreadExecutor@3275fd7b[Running, pool size = 500, active threads = 87, queued tasks = 100, completed tasks = 88300]

Caused by: java.util.concurrent.RejectedExecutionException: Task java.util.concurrent.FutureTask@7dc52f6e rejected from org.ovirt.engine.core.utils.threadpool.ThreadPoolUtil$InternalThreadExecutor@3275fd7b[Running, pool size = 500, active threads = 87, queued tasks = 100, completed tasks = 88300]

at java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2047) [rt.jar:1.8.0_121]

at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:823) [rt.jar:1.8.0_121]


seems like we can patch it as we did for the quertz pool on 
https://bugzilla.redhat.com/show_bug.cgi?id=1429534

Version-Release number of selected component (if applicable):
4.1.1.2

How reproducible:
Not clear

Steps to Reproduce:
1. 254 hosts (250 nested), 3 SDs (ISCSI), ~1400 real vms with 3 disks each (thins provision)
2. while ceating new vms.

Actual results:
task being rejected, creating vms running for long time and seems like they task rejected.
also 'ConnectDomainToStorageCommand' task rejected which may cause storage issues and prevent from vms to be created.

Expected results:
RejectPolicy as introduced here https://gerrit.ovirt.org/#/c/74034/ may resolve the problem,
but apparently postponed tasks, will handle after long time, since they way back in the queue.

Additional info:

Comment 2 Piotr Kliczewski 2017-04-04 06:52:23 UTC
Interesting find Eldad. It looks like ThreadPoolUtil rejects the tasks like quartz pool before due to bounded queue size [1]. We may want to fix it in the same way.


[1] https://github.com/oVirt/ovirt-engine/blob/master/backend/manager/modules/utils/src/main/java/org/ovirt/engine/core/utils/threadpool/ThreadPoolUtil.java#L41

Comment 3 Martin Perina 2017-04-05 06:31:08 UTC
We will add the same policy to block the thread which is trying to add a new task into exhausted queue as we used in Quartz thread pool. This will slow down engine (task are not rejected but waiting), but we really need to take a look at related storage flow in 4.2 and optimized it not to require that many threads.

Comment 4 Eldad Marciano 2017-04-05 08:50:31 UTC
(In reply to Martin Perina from comment #3)
> We will add the same policy to block the thread which is trying to add a new
> task into exhausted queue as we used in Quartz thread pool. This will slow
> down engine (task are not rejected but waiting), but we really need to take
> a look at related storage flow in 4.2 and optimized it not to require that
> many threads.

do you mind if I'll open a new bug about slow storage commands for 4.2 in order to tack it accordingly.

Comment 5 Martin Perina 2017-04-07 07:42:31 UTC
(In reply to Eldad Marciano from comment #4)
> (In reply to Martin Perina from comment #3)
> > We will add the same policy to block the thread which is trying to add a new
> > task into exhausted queue as we used in Quartz thread pool. This will slow
> > down engine (task are not rejected but waiting), but we really need to take
> > a look at related storage flow in 4.2 and optimized it not to require that
> > many threads.
> 
> do you mind if I'll open a new bug about slow storage commands for 4.2 in
> order to tack it accordingly.

Feel free to do so ...

Comment 6 Eldad Marciano 2017-05-12 09:39:50 UTC
verification status
bug was verified with vdsmfake 250 hosts, 1 SD, 2350 vms, 4340 overall disks.
on top of this topology we created vms in bulks of 10, 50, 90.
no tasks were rejected.

also we simulate high latency with vdsmfake in order stress more the utils queue.

we'll keep playing more with stressing the queue and will move this bug to verified ASAP.

currently there is no evidence for tasks rejection.

==== engine error rate ====
JdbcConnectionException - 0
Tasks rejected - 0
VM not responding - 0
Thread interrupts - 0

Comment 7 Eldad Marciano 2017-05-14 11:01:55 UTC
this bug has verified on top of rhv 4.1.2.1-0.1.el7.
topology:
250 hosts, 1 SD, 2846 VMs, 4847 disks.

we stressed the engine to more than 100 tasks per second, no tasks were rejected.

moving this bug to verified.


Note You need to log in before you can comment on or make changes to this bug.