Bug 1429534 - [scale] - tasks rejection, causing to corrupted monitoring.
Summary: [scale] - tasks rejection, causing to corrupted monitoring.
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: ovirt-engine
Classification: oVirt
Component: BLL.Infra
Version: 4.1.1.2
Hardware: x86_64
OS: Linux
high
urgent with 1 vote
Target Milestone: ovirt-4.1.1-1
: 4.1.1.7
Assignee: Martin Perina
QA Contact: Eldad Marciano
URL:
Whiteboard:
: 1417940 1506373 (view as bug list)
Depends On:
Blocks: 1364791 1417940
TreeView+ depends on / blocked
 
Reported: 2017-03-06 14:36 UTC by Eldad Marciano
Modified: 2021-05-01 16:47 UTC (History)
13 users (show)

Fixed In Version:
Clone Of:
Environment:
Last Closed: 2017-04-21 09:33:03 UTC
oVirt Team: Infra
Embargoed:
rule-engine: ovirt-4.1+
rule-engine: blocker+


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
oVirt gerrit 74034 0 'None' MERGED scheduler: make task queue size configurable 2020-12-23 22:04:00 UTC
oVirt gerrit 74326 0 'None' MERGED scheduler: make task queue size configurable 2020-12-23 22:04:00 UTC
oVirt gerrit 74328 0 'None' MERGED scheduler: make task queue size configurable 2020-12-23 22:04:00 UTC

Description Eldad Marciano 2017-03-06 14:36:52 UTC
Description of problem:

In part of scale out nested vms, some tasks were rejected and nothing will happen since than.
specially for create vms or disks.

we used the api to trigger vm creation and we found plenty of vms in locked state forever, since the engine missed their monitoring task results.

for instance let have a look on the engine output, we have plenty of:

2017-03-06 00:13:39,777+02 ERROR [org.ovirt.engine.core.utils.timer.SchedulerThreadPool] (DefaultQuartzScheduler_QuartzSchedulerThread) [] Task rejected due to: : java.util.concurrent.RejectedExecutionException: Task java.util.concurrent.FutureTask@cd2fd03 rejected from java.util.concurrent.ThreadPoolExecutor@37a79d5e[Running, pool size = 250, active threads = 250, queued tasks = 50, completed tasks = 250]

and tasks which wait for complete status.
2017-03-06 11:46:24,734+02 INFO  [org.ovirt.engine.core.bll.ConcurrentChildCommandsExecutionCallback] (DefaultQuartzScheduler222) [7caad81a] Command 'AddVm' (id: '31c2ae20-115a-4c34-b811-defa3bf118e6') waiting on child command id: '2a3dc351-602e-434b-87a1-5594ad41bbec' type:'CreateSnapshotFromTemplate' to complete


In time vdsm (SPM) reports that all tasks completed:

017-03-06 15:36:28,637+0200 INFO  (jsonrpc/0) [dispatcher] Run and protect: getAllTasksStatuses, Return response: {'allTasksStatus': {'e0c32d8d-5691-4d9a-aedf-c063f9343f3c': {'code': 0, 'message': '1 jobs completed successfully', 'taskState': 'finished', 'taskResult': 'success', 'taskID': 'e0c32d8d-5691-4d9a-aedf-c063f9343f3c'}, '10ecbc6b-c13d-4533-aeda-255b87104026': {'code': 0, 'message': '1 jobs completed successfully', 'taskState': 'finished', 'taskResult': 'success', 'taskID': '10ecbc6b-c13d-4533-aeda-255b87104026'}, '2dc4e4b0-85d2-4700-8f09-bdd87ff0c2df': {'code': 0, 'message': '1 jobs completed successfully', 'taskState': 'finished', 'taskResult': 'success', 'taskID': '2dc4e4b0-85d2-4700-8f09-bdd87ff0c2df'}, 'c0793e9f-6090-4251-a062-50baa0bf2cc1': {'code': 0, 'message': '1 jobs completed successfully', 'taskState': 'finished', 'taskResult': 'success', 'taskID': 'c0793e9f-6090-4251-a062-50baa0bf2cc1'}, '446ae593-6e99-41bc-a5c0-f2d453cfed63': {'code': 0, 'message': '1 jobs completed successfully', 'taskState': 'finished', 'taskResult': 'success', 'taskID': '446ae593-6e99-41bc-a5c0-f2d453cfed63'}, 'b8343b61-680e-4718-b1f0-4b96a2198cbd': {'code': 0, 'message': '1 jobs completed successfully', 'taskState': 'finished', 'taskResult': 'success', 'taskID': 'b8343b61-680e-4718-b1f0-4b96a2198cbd'}, '3a5355b9-0d5d-4bc1-92d1-32902f704e98': {'code': 0, 'message': '1 jobs completed successfully', 'taskState': 'finished', 'taskResult': 'success', 'taskID': '3a5355b9-0d5d-4bc1-92d1-32902f704e98'}, 'f7788864-9887-4540-83cd-079ed5c76be6': {'code': 0, 'message': '1 jobs completed successfully', 'taskState': 'finished', 'taskResult': 'success', 'taskID': 'f7788864-9887-4540-83cd-079ed5c76be6'}, 'cabbd9c9-2788-4b25-9332-e099619df80a': {'code': 0, 'message': '1 jobs completed successfully', 'taskState': 'finished', 'taskResult': 'success', 'taskID': 'cabbd9c9-2788-4b25-9332-e099619df80a'}, '83bf0607-2567-4a54-8f46-711ca197ec51': {'code': 0, 'message': '1 jobs completed successfully', 'taskState': 'finished', 'taskResult': 'success', 'taskID': '83bf0607-2567-4a54-8f46-711ca197ec51'}, 'd4775ebe-be2b-4259-8dbd-1c115d885887': {'code': 0, 'message': '1 jobs completed successfully', 'taskState': 'finished', 'taskResult': 'success', 'taskID': 'd4775ebe-be2b-4259-8dbd-1c115d885887'}, '8e2056eb-a138-47ae-9bde-de02e283d8d8': {'code': 0, 'message': '1 jobs completed successfully', 'taskState': 'finished', 'taskResult': 'success', 'taskID': '8e2056eb-a138-47ae-9bde-de02e283d8d8'}, '58dc37e2-3504-4816-929e-84d70a4e2253': {'code': 0, 'message': '1 jobs completed successfully', 'taskState': 'finished', 'taskResult': 'success', 'taskID': '58dc37e2-3504-4816-929e-84d70a4e2253'}, '9f10c0de-4964-4d89-977e-5e70e5d1f298': {'code': 0, 'message': '1 jobs completed successfully', 'taskState': 'finished', 'taskResult': 'success', 'taskID': '9f10c0de-4964-4d89-977e-5e70e5d1f298'}, '1f969cce-3100-4627-8dac-f1fce93c03b3': {'code': 0, 'message': '1 jobs completed successfully', 'taskState': 'finished', 'taskResult': 'success', 'taskID': '1f969cce-3100-4627-8dac-f1fce93c03b3'}, 'fabb7f2b-09ed-4c6f-a737-fe9999d35eb7': {'code': 0, 'message': '1 jobs completed successfully', 'taskState': 'finished', 'taskResult': 'success', 'taskID': 'fabb7f2b-09ed-4c6f-a737-fe9999d35eb7'}, '5970e714-ee76-48b0-8509-b67e592baac4': {'code': 0, 'message': '1 jobs completed successfully', 'taskState': 'finished', 'taskResult': 'success', 'taskID': '5970e714-ee76-48b0-8509-b67e592baac4'}, '4cc75096-e02e-4b8d-98e9-0749bf484450': {'code': 0, 'message': '1 jobs completed successfully', 'taskState': 'finished', 'taskResult': 'success', 'taskID': '4cc75096-e02e-4b8d-98e9-0749bf484450'}, '12abf7d1-6df5-4d23-8c8a-b7c550bef406': {'code': 0, 'message': '1 jobs completed successfully', 'taskState': 'finished', 'taskResult': 'success', 'taskID': '12abf7d1-6df5-4d23-8c8a-b7c550bef406'}, '9510ff20-283a-4c4e-a67b-91faade156e2': {'code': 0, 'message': '1 jobs completed successfully', 'taskState': 'finished', 'taskResult': 'success', 'taskID': '9510ff20-283a-4c4e-a67b-91faade156e2'}, '51fcb0be-6f95-4f48-8007-69411fe45abb': {'code': 0, 'message': '1 jobs completed successfully', 'taskState': 'finished', 'taskResult': 'success', 'taskID': '51fcb0be-6f95-4f48-8007-69411fe45abb'}, '015acaca-4928-4060-abc0-6c7d49ee1412': {'code': 0, 'message': '1 jobs completed successfully', 'taskState': 'finished', 'taskResult': 'success', 'taskID': '015acaca-4928-4060-abc0-6c7d49ee1412'}, '1aca91b7-7c60-4728-943b-840435231992': {'code': 0, 'message': '1 jobs completed successfully', 'taskState': 'finished', 'taskResult': 'success', 'taskID': '1aca91b7-7c60-4728-943b-840435231992'}, '04cbd940-69ed-4f18-a646-8c742182a43d': {'code': 0, 'message': '1 jobs completed successfully', 'taskState': 'finished', 'taskResult': 'success', 'taskID': '04cbd940-69ed-4f18-a646-8c742182a43d'}, 'dc9ffe80-4214-4c97-ae0c-feaf781f4190': {'code': 0, 'message': '1 jobs completed successfully', 'taskState': 'finished', 'taskResult': 'success', 'taskID': 'dc9ffe80-4214-4c97-ae0c-feaf781f4190'}, '0ec928d6-fed1-413e-ae64-102b5551ffd6': {'code': 0, 'message': '1 jobs completed successfully', 'taskState': 'finished', 'taskResult': 'success', 'taskID': '0ec928d6-fed1-413e-ae64-102b5551ffd6'}, '5d16de2f-3e60-43f7-aa38-d38f683dd247': {'code': 0, 'message': '1 jobs completed successfully', 'taskState': 'finished', 'taskResult': 'success', 'taskID': '5d16de2f-3e60-43f7-aa38-d38f683dd247'}, 'dc777dca-5a89-41b3-9132-d06298abdf66': {'code': 0, 'message': '1 jobs completed successfully', 'taskState': 'finished', 'taskResult': 'success', 'taskID': 'dc777dca-5a89-41b3-9132-d06298abdf66'}, '347333ce-f463-49e2-b8b4-58291a1d1e71': {'code': 0, 'message': '1 jobs completed successfully', 'taskState': 'finished', 'taskResult': 'success', 'taskID': '347333ce-f463-49e2-b8b4-58291a1d1e71'}, '89b31229-d880-4f7f-b0d6-79c0f9b584c2': {'code': 0, 'message': '1 jobs completed successfully', 'taskState': 'finished', 'taskResult': 'success', 'taskID': '89b31229-d880-4f7f-b0d6-79c0f9b584c2'}, '4ae345db-09cb-4455-a94a-e5f54dba7dc3': {'code': 0, 'message': '1 jobs completed successfully', 'taskState': 'finished', 'taskResult': 'success', 'taskID': '4ae345db-09cb-4455-a94a-e5f54dba7dc3'}, 'e47c0497-65f6-472c-827b-158c59c74aee': {'code': 0, 'message': '1 jobs completed successfully', 'taskState': 'finished', 'taskResult': 'success', 'taskID': 'e47c0497-65f6-472c-827b-158c59c74aee'}, '71d4aafa-ac8b-430e-b8f6-485e4ffde67d': {'code': 0, 'message': '1 jobs completed successfully', 'taskState': 'finished', 'taskResult': 'success', 'taskID': '71d4aafa-ac8b-430e-b8f6-485e4ffde67d'}, '00744921-cf79-44db-84a8-c195e3cb3d88': {'code': 0, 'message': '1 jobs completed successfully', 'taskState': 'finished', 'taskResult': 'success', 'taskID': '00744921-cf79-44db-84a8-c195e3cb3d88'}, 'f380c2e8-ef80-4cc5-a5bf-227eac93ac12': {'code': 0, 'message': '1 jobs completed successfully', 'taskState': 'finished', 'taskResult': 'success', 'taskID': 'f380c2e8-ef80-4cc5-a5bf-227eac93ac12'}, 'adb119f2-2731-441a-8c5f-5d6feffaa628': {'code': 0, 'message': '1 jobs completed successfully', 'taskState': 'finished', 'taskResult': 'success', 'taskID': 'adb119f2-2731-441a-8c5f-5d6feffaa628'}, 'f6a11d78-c29b-413d-b3ef-cd95cf4f2388': {'code': 0, 'message': '1 jobs completed successfully', 'taskState': 'finished', 'taskResult': 'success', 'taskID': 'f6a11d78-c29b-413d-b3ef-cd95cf4f2388'}, 'b9d8a1b6-59f9-48d1-a585-53d8c6cb39fb': {'code': 0, 'message': '1 jobs completed successfully', 'taskState': 'finished', 'taskResult': 'success', 'taskID': 'b9d8a1b6-59f9-48d1-a585-53d8c6cb39fb'}, '1a557b8c-2c0e-4979-9220-10064724dad9': {'code': 0, 'message': '1 jobs completed successfully', 'taskState': 'finished', 'taskResult': 'success', 'taskID': '1a557b8c-2c0e-4979-9220-10064724dad9'}, 'e337a0c9-5c34-4b92-894c-58ab90656464': {'code': 0, 'message': '1 jobs completed successfully', 'taskState': 'finished', 'taskResult': 'success', 'taskID': 'e337a0c9-5c34-4b92-894c-58ab90656464'}, 'eef8223f-7838-46c3-b27d-56ee621a65a9': {'code': 0, 'message': '1 jobs completed successfully', 'taskState': 'finished', 'taskResult': 'success', 'taskID': 'eef8223f-7838-46c3-b27d-56ee621a65a9'}, '88594424-25cc-4224-9d8e-1ad715d25d4c': {'code': 0, 'message': '1 jobs completed successfully', 'taskState': 'finished', 'taskResult': 'success', 'taskID': '88594424-25cc-4224-9d8e-1ad715d25d4c'}, 'c4cb4f2e-8caf-47f7-82f8-8b15d2f6502e': {'code': 0, 'message': '1 jobs completed successfully', 'taskState': 'finished', 'taskResult': 'success', 'taskID': 'c4cb4f2e-8caf-47f7-82f8-8b15d2f6502e'}, '2d142116-a062-45af-9dbf-bd57ac276660': {'code': 0, 'message': '1 jobs completed successfully', 'taskState': 'finished', 'taskResult': 'success', 'taskID': '2d142116-a062-45af-9dbf-bd57ac276660'}, 'e0e4b4f0-44c7-4577-8098-a4a4111e43bf': {'code': 0, 'message': '1 jobs completed successfully', 'taskState': 'finished', 'taskResult': 'success', 'taskID': 'e0e4b4f0-44c7-4577-8098-a4a4111e43bf'}, '53708fa5-e991-4cd6-86b0-c7ae01cff2dd': {'code': 0, 'message': '1 jobs completed successfully', 'taskState': 'finished', 'taskResult': 'success', 'taskID': '53708fa5-e991-4cd6-86b0-c7ae01cff2dd'}, '06c8a081-6d50-4a50-ae2d-91c80dcd971b': {'code': 0, 'message': '1 jobs completed successfully', 'taskState': 'finished', 'taskResult': 'success', 'taskID': '06c8a081-6d50-4a50-ae2d-91c80dcd971b'}, '9f88fae0-50e6-45d4-ada7-fde0c28a61c2': {'code': 0, 'message': '1 jobs completed successfully', 'taskState': 'finished', 'taskResult': 'success', 'taskID': '9f88fae0-50e6-45d4-ada7-fde0c28a61c2'}, '54d8b678-e4f1-464f-9559-1e8c7079d45b': {'code': 0, 'message': '1 jobs completed successfully', 'taskState': 'finished', 'taskResult': 'success', 'taskID': '54d8b678-e4f1-464f-9559-1e8c7079d45b'}, 'f0aa125c-2838-41f2-aa54-5f8cc7785e89': {'code': 0, 'message': '1 jobs completed successfully', 'taskState': 'finished', 'taskResult': 'success', 'taskID': 'f0aa125c-2838-41f2-aa54-5f8cc7785e89'}, 'a0dda621-eed2-4d6a-b8a5-a902ba434f83': {'code': 0, 'message': '1 jobs completed successfully', 'taskState': 'finished', 'taskResult': 'success', 'taskID': 'a0dda621-eed2-4d6a-b8a5-a902ba434f83'}, 'ee031738-8f04-47d7-b089-c6e06dd4f3c5': {'code': 0, 'message': '1 jobs completed successfully', 'taskState': 'finished', 'taskResult': 'success', 'taskID': 'ee031738-8f04-47d7-b089-c6e06dd4f3c5'}, 'c2c19593-a412-4c02-9bf0-85a87d8754aa': {'code': 0, 'message': '1 jobs completed successfully', 'taskState': 'finished', 'taskResult': 'success', 'taskID': 'c2c19593-a412-4c02-9bf0-85a87d8754aa'}, '5ebc6d77-a833-4267-b436-8b2f9fc75b82': {'code': 0, 'message': '1 jobs completed successfully', 'taskState': 'finished', 'taskResult': 'success', 'taskID': '5ebc6d77-a833-4267-b436-8b2f9fc75b82'}}} (logUtils:54)


Version-Release number of selected component (if applicable):
4.1.1-2

How reproducible:
not clear

Steps to Reproduce:
1. have the following profile: a cluster with 200 nested hosts connected to NFS server.
2. create vms for the nested cluster in bulks of 50.

Actual results:
vms stuck on locked status \ or disk locked status, and engine missed their monitoring results, due to rejected tasks. 

Expected results:
tasks should not be rejected, or at least have some exception handling for fault and recovery handling.

Additional info:

Comment 1 Eldad Marciano 2017-03-06 14:38:06 UTC
Roy, do I missed something ?!

Comment 3 Eldad Marciano 2017-03-06 14:44:14 UTC
I forgot to mention that we have hard coded BlockingQueue size for the quartz thread pool size, that may affect the problem, currently the queue size is 50.
we should make it configurable as we configured the thread pool from the xml.in

moreover, this queue is shared to all quartz threads ?!

Comment 16 Piotr Kliczewski 2017-03-13 08:35:29 UTC
I think that we could attempt to fix it by changing default queue size [1] that is used by the executor. This setting should not increase amount of resource being used and we could empirically check what value would work for our default setting of 250 hosts.


[1] https://github.com/oVirt/ovirt-engine/blob/master/backend/manager/modules/scheduler/src/main/java/org/ovirt/engine/core/utils/timer/SchedulerThreadPool.java#L52

Comment 17 Eldad Marciano 2017-03-13 22:14:23 UTC
(In reply to Piotr Kliczewski from comment #16)
> I think that we could attempt to fix it by changing default queue size [1]
> that is used by the executor. This setting should not increase amount of
> resource being used and we could empirically check what value would work for
> our default setting of 250 hosts.
> 
> 
> [1]
> https://github.com/oVirt/ovirt-engine/blob/master/backend/manager/modules/
> scheduler/src/main/java/org/ovirt/engine/core/utils/timer/
> SchedulerThreadPool.java#L52

Right, thats sound reasonable fix.
it will be great if this value will be editable as the qaurtzpool.

also, it will be great to consider retry mechanism for rejected tasks.

Comment 22 Roy Golan 2017-03-14 14:04:45 UTC
The root cause here is that we loose the the minute it is rejected. This means that in case of a task with Fixed-Delay-Job, the listener will never reschedule the job for the next run. This means the system stop doing core jobs. It is a major bug. 

This means that increasing the number of threads is not a viable solution. You don't know when and under which conditions you are going to be rejected. How high is enough? 200, 220, double the amount of hosts?

Handling properly a rejected task means we are able to handle peeks, with the limited resources we have without being affected. Thus bug was introduced once we moved to our implementation of the pool. Every version we have out there that is going to peek the 100 concurrent scheduled tasks is in danger of going zombie.

We must not reject a task if we can schedule it atm.
The original quarz impl never reject a task it seems -

from 'org.quarz.simple.SimpleThreadPool'

 public boolean runInThread(Runnable runnable) {
        if (runnable == null) {
            return false;
        }

        synchronized (nextRunnableLock) {

            handoffPending = true;

            // Wait until a worker thread is available
            while ((availWorkers.size() < 1) && !isShutdown) {
                try {
                    nextRunnableLock.wait(500);
                } catch (InterruptedException ignore) {
                }
            }

            if (!isShutdown) {
                WorkerThread wt = (WorkerThread)availWorkers.removeFirst();
                busyWorkers.add(wt);
                wt.run(runnable);
            } else {
                // If the thread pool is going down, execute the Runnable
                // within a new additional worker thread (no thread from the pool).
                WorkerThread wt = new WorkerThread(this, threadGroup,
                        "WorkerThread-LastJob", prio, isMakeThreadsDaemons(), runnable);
                busyWorkers.add(wt);
                workers.add(wt);
                wt.start();
            }
            nextRunnableLock.notifyAll();
            handoffPending = false;
        }

        return true;
    }

Comment 23 Martin Perina 2017-03-14 20:12:21 UTC
Piotr, could you please check with fakevdsm if reverting https://gerrit.ovirt.org/48667 will fix the issue? I mean if engine will be able to handle 250 fakevdsm hosts without increasing Quartz thread pool size?

Comment 24 Roy Golan 2017-03-15 08:01:15 UTC
It's enough to change the ovirt-engine.xml.in:

 <property name="org.quartz.threadPool.class" value="org.quartz.simpl.SimpleThreadPool"/>

With this setup my system works with pool of 30 and 50 hosts. Futher tests need to be done.

Comment 25 Roy Golan 2017-03-15 08:02:18 UTC
Adding masayag and laravot who possibly saw COCO task dropping in some environments apparantely

Comment 26 Piotr Kliczewski 2017-03-15 08:17:49 UTC
Martin,

The idea behind using ThreadPoolExecutor was that it is more efficient in therms of using resources when underutilized. If not used the threads are killed and created if required (slow start). This pool is more efficient. Using quartz provided pool would make it work again but we loose flexibility of executor. Configuring queue size and making sure that it is big enough for 250 hosts should be much better solution in therms of resource management.

Comment 27 Oved Ourfali 2017-03-15 08:20:15 UTC
(In reply to Piotr Kliczewski from comment #26)
> Martin,
> 
> The idea behind using ThreadPoolExecutor was that it is more efficient in
> therms of using resources when underutilized. If not used the threads are
> killed and created if required (slow start). This pool is more efficient.
> Using quartz provided pool would make it work again but we loose flexibility
> of executor. Configuring queue size and making sure that it is big enough
> for 250 hosts should be much better solution in therms of resource
> management.

In what way is it more efficient?
Less memory resources? Anything else?

Comment 28 Piotr Kliczewski 2017-03-15 09:00:09 UTC
(In reply to Oved Ourfali from comment #27)
> (In reply to Piotr Kliczewski from comment #26)
> > Martin,
> > 
> > The idea behind using ThreadPoolExecutor was that it is more efficient in
> > therms of using resources when underutilized. If not used the threads are
> > killed and created if required (slow start). This pool is more efficient.
> > Using quartz provided pool would make it work again but we loose flexibility
> > of executor. Configuring queue size and making sure that it is big enough
> > for 250 hosts should be much better solution in therms of resource
> > management.
> 
> In what way is it more efficient?
> Less memory resources? Anything else?

As I described above the # of threads being run is not constant and it adjusted for current utilization pattern so it would mean more optimal memory and cpu utilization. We just make sure that waiting queue is big enough so the tasks are not dropped. We already had a patch for it and we need to empirically check correct size for 250 hosts (our default limit).

Comment 29 Oved Ourfali 2017-03-15 09:12:13 UTC
We can test that as well.
However, I haven't seen evidence that there are CPU/RAM issues to begin with, so not sure it is worth it.

Comment 30 Martin Perina 2017-03-15 20:35:55 UTC
(In reply to Roy Golan from comment #22)
> The root cause here is that we loose the the minute it is rejected. This
> means that in case of a task with Fixed-Delay-Job, the listener will never
> reschedule the job for the next run. This means the system stop doing core
> jobs. It is a major bug. 
> 
> This means that increasing the number of threads is not a viable solution.
> You don't know when and under which conditions you are going to be rejected.
> How high is enough? 200, 220, double the amount of hosts?
> 
> Handling properly a rejected task means we are able to handle peeks, with
> the limited resources we have without being affected. Thus bug was
> introduced once we moved to our implementation of the pool. Every version we
> have out there that is going to peek the 100 concurrent scheduled tasks is
> in danger of going zombie.
> 
> We must not reject a task if we can schedule it atm.
> The original quarz impl never reject a task it seems -

After discussing your findings with Piotr we have decided to continue with current patch, which increases queue size for tasks when thread pool is exhausted:

1. ThreadPoolExecutor is part of JDK and as this it's recommended solution (and we are going to move all our pools to use standard executors service anyway in 4.2)

2. Part of ThreadPoolExecutor is a definition of queue to hold tasks when thread pool doesn't have available threads. This is trying to solve the same issue in a different way than by resubmitting task as in original Quartz thread pool. We would prefer to use recommended solution rather then combining it with resubmitting.

3. Also ThreadPoolExecutor is able to manager threads in a smart way, it allocates thread when needed and remove them when not needed anymore (original Quartz thread pool allocated all threads on startup and hold them when they were not needed). Assuming that most of engine users doesn't have more than 100 hosts, this allows us to lower number of threads for most setups and improve performance by allocating more CPU time for other "useful" threads.

Comment 31 Moti Asayag 2017-03-16 07:05:13 UTC
(In reply to Roy Golan from comment #25)
> Adding masayag and laravot who possibly saw COCO task
> dropping in some environments apparantely

The same symptoms appeared in Bug 1417940, where it seems that CoCo task wasn't re-scheduling from a certain point of time.

A bug in Quartz was raised as a possible cause.

Comment 32 Roy Golan 2017-03-16 08:19:15 UTC
Martin, the proposed patch doesn't address the core issue of rejection. It is not a solution. at the moment simply reverting is a viable solution (and masayag only adding is enforcing this assumption). The proposed solution will only hide the problem and you never know when again this will occur.

Comment 33 Oved Ourfali 2017-03-16 08:25:20 UTC
We plan a rework for that for 4.2.
We will decide whether to revert or use the solution proposed above for 4.1.
Anyhow, the solution needs to be carefully tested both by us and QE.

Comment 34 Martin Perina 2017-03-16 12:46:43 UTC
After long discussion we propose following solution:

1. Add configuration option to engine.conf.d, which will allow to specify which thread pool implementation we will use (either the default Quartz or the new one), the default will be original Quartz -> that should fix any issues around task rejections and reflect 3.6 behaviour

2. Add configuration options to engine.conf.d, which will allow to specify thread pool size (for both Quartz and new one) and queue size (only for new one) -> this will allow us to change sizing wihtout the need to edit ovirt-engine.xml.in which is overriden on each upgrade

Comment 37 Eldad Marciano 2017-03-21 10:39:17 UTC
heads up, I patched my setup with the posted patch.
currently we can't identify any task rejection or similar issues. 

we'll verify it, once it will move to "on_qa"

Comment 38 Eldad Marciano 2017-03-21 10:47:14 UTC
(In reply to Eldad Marciano from comment #37)
> heads up, I patched my setup with the posted patch.
> currently we can't identify any task rejection or similar issues. 
> 
> we'll verify it, once it will move to "on_qa"

engine config:

ENGINE_QUARTZ_POOL_CLASS=org.ovirt.engine.core.utils.timer.SchedulerThreadPool
ENGINE_QUARTZ_POOL_THREAD_SIZE=100
ENGINE_QUARTZ_POOL_QUEUE_SIZE=350
ENGINE_QUARTZ_POOL_REJECT_POLICY=true

error rate:
==== engine error rate ====
JdbcConnectionException - 0
Tasks rejected - 0
heartbeat exceeded - 0
VM not responding - 0
Thread interrupts - 0

currently we verify it by:
- engine restart
- switching 47 hosts to maintenance and back to active.

Comment 39 Martin Perina 2017-03-23 20:58:49 UTC
Sorry, moved to MODIFIED by mistake

Comment 40 eberman 2017-04-12 08:30:06 UTC
(In reply to Eldad Marciano from comment #0)
> Description of problem:
> 
> In part of scale out nested vms, some tasks were rejected and nothing will
> happen since than.
> specially for create vms or disks.
> 
> we used the api to trigger vm creation and we found plenty of vms in locked
> state forever, since the engine missed their monitoring task results.
> 
> for instance let have a look on the engine output, we have plenty of:
> 
> 2017-03-06 00:13:39,777+02 ERROR
> [org.ovirt.engine.core.utils.timer.SchedulerThreadPool]
> (DefaultQuartzScheduler_QuartzSchedulerThread) [] Task rejected due to: :
> java.util.concurrent.RejectedExecutionException: Task
> java.util.concurrent.FutureTask@cd2fd03 rejected from
> java.util.concurrent.ThreadPoolExecutor@37a79d5e[Running, pool size = 250,
> active threads = 250, queued tasks = 50, completed tasks = 250]
> 
> and tasks which wait for complete status.
> 2017-03-06 11:46:24,734+02 INFO 
> [org.ovirt.engine.core.bll.ConcurrentChildCommandsExecutionCallback]
> (DefaultQuartzScheduler222) [7caad81a] Command 'AddVm' (id:
> '31c2ae20-115a-4c34-b811-defa3bf118e6') waiting on child command id:
> '2a3dc351-602e-434b-87a1-5594ad41bbec' type:'CreateSnapshotFromTemplate' to
> complete
> 
> 
> In time vdsm (SPM) reports that all tasks completed:
> 
> 017-03-06 15:36:28,637+0200 INFO  (jsonrpc/0) [dispatcher] Run and protect:
> getAllTasksStatuses, Return response: {'allTasksStatus':
> {'e0c32d8d-5691-4d9a-aedf-c063f9343f3c': {'code': 0, 'message': '1 jobs
> completed successfully', 'taskState': 'finished', 'taskResult': 'success',
> 'taskID': 'e0c32d8d-5691-4d9a-aedf-c063f9343f3c'},
> '10ecbc6b-c13d-4533-aeda-255b87104026': {'code': 0, 'message': '1 jobs
> completed successfully', 'taskState': 'finished', 'taskResult': 'success',
> 'taskID': '10ecbc6b-c13d-4533-aeda-255b87104026'},
> '2dc4e4b0-85d2-4700-8f09-bdd87ff0c2df': {'code': 0, 'message': '1 jobs
> completed successfully', 'taskState': 'finished', 'taskResult': 'success',
> 'taskID': '2dc4e4b0-85d2-4700-8f09-bdd87ff0c2df'},
> 'c0793e9f-6090-4251-a062-50baa0bf2cc1': {'code': 0, 'message': '1 jobs
> completed successfully', 'taskState': 'finished', 'taskResult': 'success',
> 'taskID': 'c0793e9f-6090-4251-a062-50baa0bf2cc1'},
> '446ae593-6e99-41bc-a5c0-f2d453cfed63': {'code': 0, 'message': '1 jobs
> completed successfully', 'taskState': 'finished', 'taskResult': 'success',
> 'taskID': '446ae593-6e99-41bc-a5c0-f2d453cfed63'},
> 'b8343b61-680e-4718-b1f0-4b96a2198cbd': {'code': 0, 'message': '1 jobs
> completed successfully', 'taskState': 'finished', 'taskResult': 'success',
> 'taskID': 'b8343b61-680e-4718-b1f0-4b96a2198cbd'},
> '3a5355b9-0d5d-4bc1-92d1-32902f704e98': {'code': 0, 'message': '1 jobs
> completed successfully', 'taskState': 'finished', 'taskResult': 'success',
> 'taskID': '3a5355b9-0d5d-4bc1-92d1-32902f704e98'},
> 'f7788864-9887-4540-83cd-079ed5c76be6': {'code': 0, 'message': '1 jobs
> completed successfully', 'taskState': 'finished', 'taskResult': 'success',
> 'taskID': 'f7788864-9887-4540-83cd-079ed5c76be6'},
> 'cabbd9c9-2788-4b25-9332-e099619df80a': {'code': 0, 'message': '1 jobs
> completed successfully', 'taskState': 'finished', 'taskResult': 'success',
> 'taskID': 'cabbd9c9-2788-4b25-9332-e099619df80a'},
> '83bf0607-2567-4a54-8f46-711ca197ec51': {'code': 0, 'message': '1 jobs
> completed successfully', 'taskState': 'finished', 'taskResult': 'success',
> 'taskID': '83bf0607-2567-4a54-8f46-711ca197ec51'},
> 'd4775ebe-be2b-4259-8dbd-1c115d885887': {'code': 0, 'message': '1 jobs
> completed successfully', 'taskState': 'finished', 'taskResult': 'success',
> 'taskID': 'd4775ebe-be2b-4259-8dbd-1c115d885887'},
> '8e2056eb-a138-47ae-9bde-de02e283d8d8': {'code': 0, 'message': '1 jobs
> completed successfully', 'taskState': 'finished', 'taskResult': 'success',
> 'taskID': '8e2056eb-a138-47ae-9bde-de02e283d8d8'},
> '58dc37e2-3504-4816-929e-84d70a4e2253': {'code': 0, 'message': '1 jobs
> completed successfully', 'taskState': 'finished', 'taskResult': 'success',
> 'taskID': '58dc37e2-3504-4816-929e-84d70a4e2253'},
> '9f10c0de-4964-4d89-977e-5e70e5d1f298': {'code': 0, 'message': '1 jobs
> completed successfully', 'taskState': 'finished', 'taskResult': 'success',
> 'taskID': '9f10c0de-4964-4d89-977e-5e70e5d1f298'},
> '1f969cce-3100-4627-8dac-f1fce93c03b3': {'code': 0, 'message': '1 jobs
> completed successfully', 'taskState': 'finished', 'taskResult': 'success',
> 'taskID': '1f969cce-3100-4627-8dac-f1fce93c03b3'},
> 'fabb7f2b-09ed-4c6f-a737-fe9999d35eb7': {'code': 0, 'message': '1 jobs
> completed successfully', 'taskState': 'finished', 'taskResult': 'success',
> 'taskID': 'fabb7f2b-09ed-4c6f-a737-fe9999d35eb7'},
> '5970e714-ee76-48b0-8509-b67e592baac4': {'code': 0, 'message': '1 jobs
> completed successfully', 'taskState': 'finished', 'taskResult': 'success',
> 'taskID': '5970e714-ee76-48b0-8509-b67e592baac4'},
> '4cc75096-e02e-4b8d-98e9-0749bf484450': {'code': 0, 'message': '1 jobs
> completed successfully', 'taskState': 'finished', 'taskResult': 'success',
> 'taskID': '4cc75096-e02e-4b8d-98e9-0749bf484450'},
> '12abf7d1-6df5-4d23-8c8a-b7c550bef406': {'code': 0, 'message': '1 jobs
> completed successfully', 'taskState': 'finished', 'taskResult': 'success',
> 'taskID': '12abf7d1-6df5-4d23-8c8a-b7c550bef406'},
> '9510ff20-283a-4c4e-a67b-91faade156e2': {'code': 0, 'message': '1 jobs
> completed successfully', 'taskState': 'finished', 'taskResult': 'success',
> 'taskID': '9510ff20-283a-4c4e-a67b-91faade156e2'},
> '51fcb0be-6f95-4f48-8007-69411fe45abb': {'code': 0, 'message': '1 jobs
> completed successfully', 'taskState': 'finished', 'taskResult': 'success',
> 'taskID': '51fcb0be-6f95-4f48-8007-69411fe45abb'},
> '015acaca-4928-4060-abc0-6c7d49ee1412': {'code': 0, 'message': '1 jobs
> completed successfully', 'taskState': 'finished', 'taskResult': 'success',
> 'taskID': '015acaca-4928-4060-abc0-6c7d49ee1412'},
> '1aca91b7-7c60-4728-943b-840435231992': {'code': 0, 'message': '1 jobs
> completed successfully', 'taskState': 'finished', 'taskResult': 'success',
> 'taskID': '1aca91b7-7c60-4728-943b-840435231992'},
> '04cbd940-69ed-4f18-a646-8c742182a43d': {'code': 0, 'message': '1 jobs
> completed successfully', 'taskState': 'finished', 'taskResult': 'success',
> 'taskID': '04cbd940-69ed-4f18-a646-8c742182a43d'},
> 'dc9ffe80-4214-4c97-ae0c-feaf781f4190': {'code': 0, 'message': '1 jobs
> completed successfully', 'taskState': 'finished', 'taskResult': 'success',
> 'taskID': 'dc9ffe80-4214-4c97-ae0c-feaf781f4190'},
> '0ec928d6-fed1-413e-ae64-102b5551ffd6': {'code': 0, 'message': '1 jobs
> completed successfully', 'taskState': 'finished', 'taskResult': 'success',
> 'taskID': '0ec928d6-fed1-413e-ae64-102b5551ffd6'},
> '5d16de2f-3e60-43f7-aa38-d38f683dd247': {'code': 0, 'message': '1 jobs
> completed successfully', 'taskState': 'finished', 'taskResult': 'success',
> 'taskID': '5d16de2f-3e60-43f7-aa38-d38f683dd247'},
> 'dc777dca-5a89-41b3-9132-d06298abdf66': {'code': 0, 'message': '1 jobs
> completed successfully', 'taskState': 'finished', 'taskResult': 'success',
> 'taskID': 'dc777dca-5a89-41b3-9132-d06298abdf66'},
> '347333ce-f463-49e2-b8b4-58291a1d1e71': {'code': 0, 'message': '1 jobs
> completed successfully', 'taskState': 'finished', 'taskResult': 'success',
> 'taskID': '347333ce-f463-49e2-b8b4-58291a1d1e71'},
> '89b31229-d880-4f7f-b0d6-79c0f9b584c2': {'code': 0, 'message': '1 jobs
> completed successfully', 'taskState': 'finished', 'taskResult': 'success',
> 'taskID': '89b31229-d880-4f7f-b0d6-79c0f9b584c2'},
> '4ae345db-09cb-4455-a94a-e5f54dba7dc3': {'code': 0, 'message': '1 jobs
> completed successfully', 'taskState': 'finished', 'taskResult': 'success',
> 'taskID': '4ae345db-09cb-4455-a94a-e5f54dba7dc3'},
> 'e47c0497-65f6-472c-827b-158c59c74aee': {'code': 0, 'message': '1 jobs
> completed successfully', 'taskState': 'finished', 'taskResult': 'success',
> 'taskID': 'e47c0497-65f6-472c-827b-158c59c74aee'},
> '71d4aafa-ac8b-430e-b8f6-485e4ffde67d': {'code': 0, 'message': '1 jobs
> completed successfully', 'taskState': 'finished', 'taskResult': 'success',
> 'taskID': '71d4aafa-ac8b-430e-b8f6-485e4ffde67d'},
> '00744921-cf79-44db-84a8-c195e3cb3d88': {'code': 0, 'message': '1 jobs
> completed successfully', 'taskState': 'finished', 'taskResult': 'success',
> 'taskID': '00744921-cf79-44db-84a8-c195e3cb3d88'},
> 'f380c2e8-ef80-4cc5-a5bf-227eac93ac12': {'code': 0, 'message': '1 jobs
> completed successfully', 'taskState': 'finished', 'taskResult': 'success',
> 'taskID': 'f380c2e8-ef80-4cc5-a5bf-227eac93ac12'},
> 'adb119f2-2731-441a-8c5f-5d6feffaa628': {'code': 0, 'message': '1 jobs
> completed successfully', 'taskState': 'finished', 'taskResult': 'success',
> 'taskID': 'adb119f2-2731-441a-8c5f-5d6feffaa628'},
> 'f6a11d78-c29b-413d-b3ef-cd95cf4f2388': {'code': 0, 'message': '1 jobs
> completed successfully', 'taskState': 'finished', 'taskResult': 'success',
> 'taskID': 'f6a11d78-c29b-413d-b3ef-cd95cf4f2388'},
> 'b9d8a1b6-59f9-48d1-a585-53d8c6cb39fb': {'code': 0, 'message': '1 jobs
> completed successfully', 'taskState': 'finished', 'taskResult': 'success',
> 'taskID': 'b9d8a1b6-59f9-48d1-a585-53d8c6cb39fb'},
> '1a557b8c-2c0e-4979-9220-10064724dad9': {'code': 0, 'message': '1 jobs
> completed successfully', 'taskState': 'finished', 'taskResult': 'success',
> 'taskID': '1a557b8c-2c0e-4979-9220-10064724dad9'},
> 'e337a0c9-5c34-4b92-894c-58ab90656464': {'code': 0, 'message': '1 jobs
> completed successfully', 'taskState': 'finished', 'taskResult': 'success',
> 'taskID': 'e337a0c9-5c34-4b92-894c-58ab90656464'},
> 'eef8223f-7838-46c3-b27d-56ee621a65a9': {'code': 0, 'message': '1 jobs
> completed successfully', 'taskState': 'finished', 'taskResult': 'success',
> 'taskID': 'eef8223f-7838-46c3-b27d-56ee621a65a9'},
> '88594424-25cc-4224-9d8e-1ad715d25d4c': {'code': 0, 'message': '1 jobs
> completed successfully', 'taskState': 'finished', 'taskResult': 'success',
> 'taskID': '88594424-25cc-4224-9d8e-1ad715d25d4c'},
> 'c4cb4f2e-8caf-47f7-82f8-8b15d2f6502e': {'code': 0, 'message': '1 jobs
> completed successfully', 'taskState': 'finished', 'taskResult': 'success',
> 'taskID': 'c4cb4f2e-8caf-47f7-82f8-8b15d2f6502e'},
> '2d142116-a062-45af-9dbf-bd57ac276660': {'code': 0, 'message': '1 jobs
> completed successfully', 'taskState': 'finished', 'taskResult': 'success',
> 'taskID': '2d142116-a062-45af-9dbf-bd57ac276660'},
> 'e0e4b4f0-44c7-4577-8098-a4a4111e43bf': {'code': 0, 'message': '1 jobs
> completed successfully', 'taskState': 'finished', 'taskResult': 'success',
> 'taskID': 'e0e4b4f0-44c7-4577-8098-a4a4111e43bf'},
> '53708fa5-e991-4cd6-86b0-c7ae01cff2dd': {'code': 0, 'message': '1 jobs
> completed successfully', 'taskState': 'finished', 'taskResult': 'success',
> 'taskID': '53708fa5-e991-4cd6-86b0-c7ae01cff2dd'},
> '06c8a081-6d50-4a50-ae2d-91c80dcd971b': {'code': 0, 'message': '1 jobs
> completed successfully', 'taskState': 'finished', 'taskResult': 'success',
> 'taskID': '06c8a081-6d50-4a50-ae2d-91c80dcd971b'},
> '9f88fae0-50e6-45d4-ada7-fde0c28a61c2': {'code': 0, 'message': '1 jobs
> completed successfully', 'taskState': 'finished', 'taskResult': 'success',
> 'taskID': '9f88fae0-50e6-45d4-ada7-fde0c28a61c2'},
> '54d8b678-e4f1-464f-9559-1e8c7079d45b': {'code': 0, 'message': '1 jobs
> completed successfully', 'taskState': 'finished', 'taskResult': 'success',
> 'taskID': '54d8b678-e4f1-464f-9559-1e8c7079d45b'},
> 'f0aa125c-2838-41f2-aa54-5f8cc7785e89': {'code': 0, 'message': '1 jobs
> completed successfully', 'taskState': 'finished', 'taskResult': 'success',
> 'taskID': 'f0aa125c-2838-41f2-aa54-5f8cc7785e89'},
> 'a0dda621-eed2-4d6a-b8a5-a902ba434f83': {'code': 0, 'message': '1 jobs
> completed successfully', 'taskState': 'finished', 'taskResult': 'success',
> 'taskID': 'a0dda621-eed2-4d6a-b8a5-a902ba434f83'},
> 'ee031738-8f04-47d7-b089-c6e06dd4f3c5': {'code': 0, 'message': '1 jobs
> completed successfully', 'taskState': 'finished', 'taskResult': 'success',
> 'taskID': 'ee031738-8f04-47d7-b089-c6e06dd4f3c5'},
> 'c2c19593-a412-4c02-9bf0-85a87d8754aa': {'code': 0, 'message': '1 jobs
> completed successfully', 'taskState': 'finished', 'taskResult': 'success',
> 'taskID': 'c2c19593-a412-4c02-9bf0-85a87d8754aa'},
> '5ebc6d77-a833-4267-b436-8b2f9fc75b82': {'code': 0, 'message': '1 jobs
> completed successfully', 'taskState': 'finished', 'taskResult': 'success',
> 'taskID': '5ebc6d77-a833-4267-b436-8b2f9fc75b82'}}} (logUtils:54)
> 
> 
> Version-Release number of selected component (if applicable):
> 4.1.1-2
> 
> How reproducible:
> not clear
> 
> Steps to Reproduce:
> 1. have the following profile: a cluster with 200 nested hosts connected to
> NFS server.
> 2. create vms for the nested cluster in bulks of 50.
> 
> Actual results:
> vms stuck on locked status \ or disk locked status, and engine missed their
> monitoring results, due to rejected tasks. 
> 
> Expected results:
> tasks should not be rejected, or at least have some exception handling for
> fault and recovery handling.
> 
> Additional info:

currently tried finding the same error that started this issue :according to :Task rejected due to ,,,
xzgrep "Task rejected due to" /var/log/ovirt-engine/engine.log*

4.1.1.2-0.1.el7

i dont see it with 
currently 250 host (100 are up) + ~1300 Vms up and based on this verification in comment 37 - moving to verify 

(In reply to Eldad Marciano from comment #38)
> (In reply to Eldad Marciano from comment #37)
> > heads up, I patched my setup with the posted patch.
> > currently we can't identify any task rejection or similar issues. 
> > 
> > we'll verify it, once it will move to "on_qa"
> 
> engine config:
> 
> ENGINE_QUARTZ_POOL_CLASS=org.ovirt.engine.core.utils.timer.
> SchedulerThreadPool
> ENGINE_QUARTZ_POOL_THREAD_SIZE=100
> ENGINE_QUARTZ_POOL_QUEUE_SIZE=350
> ENGINE_QUARTZ_POOL_REJECT_POLICY=true
> 
> error rate:
> ==== engine error rate ====
> JdbcConnectionException - 0
> Tasks rejected - 0
> heartbeat exceeded - 0
> VM not responding - 0
> Thread interrupts - 0
> 
> currently we verify it by:
> - engine restart
> - switching 47 hosts to maintenance and back to active.

Comment 41 Allon Mureinik 2017-11-01 10:04:56 UTC
*** Bug 1506373 has been marked as a duplicate of this bug. ***

Comment 42 Martin Perina 2018-04-16 13:18:33 UTC
*** Bug 1417940 has been marked as a duplicate of this bug. ***


Note You need to log in before you can comment on or make changes to this bug.