Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1438497

Summary:	[scale] - tasks rejection by thread pool util
Product:	[oVirt] ovirt-engine	Reporter:	Eldad Marciano <emarcian>
Component:	Backend.Core	Assignee:	Martin Perina <mperina>
Status:	CLOSED CURRENTRELEASE	QA Contact:	Eldad Marciano <emarcian>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	4.1.1.2	CC:	bugs, lveyde, mperina, pkliczew
Target Milestone:	ovirt-4.1.2	Keywords:	Performance
Target Release:	4.1.2	Flags:	rule-engine: ovirt-4.1+ rule-engine: blocker+
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2017-05-23 08:18:31 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	Infra	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Eldad Marciano 2017-04-03 14:20:45 UTC

Description of problem:

engine reject tasks for thread pool util on high scale, specially for 'ConnectDomainToStorageCommand'

2017-04-02 17:55:01,465+03 WARN  [org.ovirt.engine.core.utils.threadpool.ThreadPoolUtil] (DefaultQuartzScheduler24) [296ae34] The thread pool failed to execute list of tasks: Task java.util.concurrent.FutureTask@7dc52f6e rejected from org.ovirt.engine.core.utils.threadpool.ThreadPoolUtil$InternalThreadExecutor@3275fd7b[Running, pool size = 500, active threads = 87, queued tasks = 100, completed tasks = 88300]

2017-04-02 17:55:01,465+03 ERROR [org.ovirt.engine.core.bll.storage.connection.ConnectDomainToStorageCommand] (DefaultQuartzScheduler24) [296ae34] Command 'org.ovirt.engine.core.bll.storage.connection.ConnectDomainToStorageCommand' failed: java.util.concurrent.RejectedExecutionException: Task java.util.concurrent.FutureTask@7dc52f6e rejected from org.ovirt.engine.core.utils.threadpool.ThreadPoolUtil$InternalThreadExecutor@3275fd7b[Running, pool size = 500, active threads = 87, queued tasks = 100, completed tasks = 88300]

2017-04-02 17:55:01,465+03 ERROR [org.ovirt.engine.core.bll.storage.connection.ConnectDomainToStorageCommand] (DefaultQuartzScheduler24) [296ae34] Exception: java.lang.RuntimeException: java.util.concurrent.RejectedExecutionException: Task java.util.concurrent.FutureTask@7dc52f6e rejected from org.ovirt.engine.core.utils.threadpool.ThreadPoolUtil$InternalThreadExecutor@3275fd7b[Running, pool size = 500, active threads = 87, queued tasks = 100, completed tasks = 88300]

Caused by: java.util.concurrent.RejectedExecutionException: Task java.util.concurrent.FutureTask@7dc52f6e rejected from org.ovirt.engine.core.utils.threadpool.ThreadPoolUtil$InternalThreadExecutor@3275fd7b[Running, pool size = 500, active threads = 87, queued tasks = 100, completed tasks = 88300]

at java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2047) [rt.jar:1.8.0_121]

at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:823) [rt.jar:1.8.0_121]


seems like we can patch it as we did for the quertz pool on 
https://bugzilla.redhat.com/show_bug.cgi?id=1429534

Version-Release number of selected component (if applicable):
4.1.1.2

How reproducible:
Not clear

Steps to Reproduce:
1. 254 hosts (250 nested), 3 SDs (ISCSI), ~1400 real vms with 3 disks each (thins provision)
2. while ceating new vms.

Actual results:
task being rejected, creating vms running for long time and seems like they task rejected.
also 'ConnectDomainToStorageCommand' task rejected which may cause storage issues and prevent from vms to be created.

Expected results:
RejectPolicy as introduced here https://gerrit.ovirt.org/#/c/74034/ may resolve the problem,
but apparently postponed tasks, will handle after long time, since they way back in the queue.

Additional info:

Comment 2 Piotr Kliczewski 2017-04-04 06:52:23 UTC

Interesting find Eldad. It looks like ThreadPoolUtil rejects the tasks like quartz pool before due to bounded queue size [1]. We may want to fix it in the same way.


[1] https://github.com/oVirt/ovirt-engine/blob/master/backend/manager/modules/utils/src/main/java/org/ovirt/engine/core/utils/threadpool/ThreadPoolUtil.java#L41

Comment 3 Martin Perina 2017-04-05 06:31:08 UTC

We will add the same policy to block the thread which is trying to add a new task into exhausted queue as we used in Quartz thread pool. This will slow down engine (task are not rejected but waiting), but we really need to take a look at related storage flow in 4.2 and optimized it not to require that many threads.

Comment 4 Eldad Marciano 2017-04-05 08:50:31 UTC

(In reply to Martin Perina from comment #3)
> We will add the same policy to block the thread which is trying to add a new
> task into exhausted queue as we used in Quartz thread pool. This will slow
> down engine (task are not rejected but waiting), but we really need to take
> a look at related storage flow in 4.2 and optimized it not to require that
> many threads.

do you mind if I'll open a new bug about slow storage commands for 4.2 in order to tack it accordingly.

Comment 5 Martin Perina 2017-04-07 07:42:31 UTC

(In reply to Eldad Marciano from comment #4)
> (In reply to Martin Perina from comment #3)
> > We will add the same policy to block the thread which is trying to add a new
> > task into exhausted queue as we used in Quartz thread pool. This will slow
> > down engine (task are not rejected but waiting), but we really need to take
> > a look at related storage flow in 4.2 and optimized it not to require that
> > many threads.
> 
> do you mind if I'll open a new bug about slow storage commands for 4.2 in
> order to tack it accordingly.

Feel free to do so ...

Comment 6 Eldad Marciano 2017-05-12 09:39:50 UTC

verification status
bug was verified with vdsmfake 250 hosts, 1 SD, 2350 vms, 4340 overall disks.
on top of this topology we created vms in bulks of 10, 50, 90.
no tasks were rejected.

also we simulate high latency with vdsmfake in order stress more the utils queue.

we'll keep playing more with stressing the queue and will move this bug to verified ASAP.

currently there is no evidence for tasks rejection.

==== engine error rate ====
JdbcConnectionException - 0
Tasks rejected - 0
VM not responding - 0
Thread interrupts - 0

Comment 7 Eldad Marciano 2017-05-14 11:01:55 UTC

this bug has verified on top of rhv 4.1.2.1-0.1.el7.
topology:
250 hosts, 1 SD, 2846 VMs, 4847 disks.

we stressed the engine to more than 100 tasks per second, no tasks were rejected.

moving this bug to verified.