Description of problem: In part of scale out nested vms, some tasks were rejected and nothing will happen since than. specially for create vms or disks. we used the api to trigger vm creation and we found plenty of vms in locked state forever, since the engine missed their monitoring task results. for instance let have a look on the engine output, we have plenty of: 2017-03-06 00:13:39,777+02 ERROR [org.ovirt.engine.core.utils.timer.SchedulerThreadPool] (DefaultQuartzScheduler_QuartzSchedulerThread) [] Task rejected due to: : java.util.concurrent.RejectedExecutionException: Task java.util.concurrent.FutureTask@cd2fd03 rejected from java.util.concurrent.ThreadPoolExecutor@37a79d5e[Running, pool size = 250, active threads = 250, queued tasks = 50, completed tasks = 250] and tasks which wait for complete status. 2017-03-06 11:46:24,734+02 INFO [org.ovirt.engine.core.bll.ConcurrentChildCommandsExecutionCallback] (DefaultQuartzScheduler222) [7caad81a] Command 'AddVm' (id: '31c2ae20-115a-4c34-b811-defa3bf118e6') waiting on child command id: '2a3dc351-602e-434b-87a1-5594ad41bbec' type:'CreateSnapshotFromTemplate' to complete In time vdsm (SPM) reports that all tasks completed: 017-03-06 15:36:28,637+0200 INFO (jsonrpc/0) [dispatcher] Run and protect: getAllTasksStatuses, Return response: {'allTasksStatus': {'e0c32d8d-5691-4d9a-aedf-c063f9343f3c': {'code': 0, 'message': '1 jobs completed successfully', 'taskState': 'finished', 'taskResult': 'success', 'taskID': 'e0c32d8d-5691-4d9a-aedf-c063f9343f3c'}, '10ecbc6b-c13d-4533-aeda-255b87104026': {'code': 0, 'message': '1 jobs completed successfully', 'taskState': 'finished', 'taskResult': 'success', 'taskID': '10ecbc6b-c13d-4533-aeda-255b87104026'}, '2dc4e4b0-85d2-4700-8f09-bdd87ff0c2df': {'code': 0, 'message': '1 jobs completed successfully', 'taskState': 'finished', 'taskResult': 'success', 'taskID': '2dc4e4b0-85d2-4700-8f09-bdd87ff0c2df'}, 'c0793e9f-6090-4251-a062-50baa0bf2cc1': {'code': 0, 'message': '1 jobs completed successfully', 'taskState': 'finished', 'taskResult': 'success', 'taskID': 'c0793e9f-6090-4251-a062-50baa0bf2cc1'}, '446ae593-6e99-41bc-a5c0-f2d453cfed63': {'code': 0, 'message': '1 jobs completed successfully', 'taskState': 'finished', 'taskResult': 'success', 'taskID': '446ae593-6e99-41bc-a5c0-f2d453cfed63'}, 'b8343b61-680e-4718-b1f0-4b96a2198cbd': {'code': 0, 'message': '1 jobs completed successfully', 'taskState': 'finished', 'taskResult': 'success', 'taskID': 'b8343b61-680e-4718-b1f0-4b96a2198cbd'}, '3a5355b9-0d5d-4bc1-92d1-32902f704e98': {'code': 0, 'message': '1 jobs completed successfully', 'taskState': 'finished', 'taskResult': 'success', 'taskID': '3a5355b9-0d5d-4bc1-92d1-32902f704e98'}, 'f7788864-9887-4540-83cd-079ed5c76be6': {'code': 0, 'message': '1 jobs completed successfully', 'taskState': 'finished', 'taskResult': 'success', 'taskID': 'f7788864-9887-4540-83cd-079ed5c76be6'}, 'cabbd9c9-2788-4b25-9332-e099619df80a': {'code': 0, 'message': '1 jobs completed successfully', 'taskState': 'finished', 'taskResult': 'success', 'taskID': 'cabbd9c9-2788-4b25-9332-e099619df80a'}, '83bf0607-2567-4a54-8f46-711ca197ec51': {'code': 0, 'message': '1 jobs completed successfully', 'taskState': 'finished', 'taskResult': 'success', 'taskID': '83bf0607-2567-4a54-8f46-711ca197ec51'}, 'd4775ebe-be2b-4259-8dbd-1c115d885887': {'code': 0, 'message': '1 jobs completed successfully', 'taskState': 'finished', 'taskResult': 'success', 'taskID': 'd4775ebe-be2b-4259-8dbd-1c115d885887'}, '8e2056eb-a138-47ae-9bde-de02e283d8d8': {'code': 0, 'message': '1 jobs completed successfully', 'taskState': 'finished', 'taskResult': 'success', 'taskID': '8e2056eb-a138-47ae-9bde-de02e283d8d8'}, '58dc37e2-3504-4816-929e-84d70a4e2253': {'code': 0, 'message': '1 jobs completed successfully', 'taskState': 'finished', 'taskResult': 'success', 'taskID': '58dc37e2-3504-4816-929e-84d70a4e2253'}, '9f10c0de-4964-4d89-977e-5e70e5d1f298': {'code': 0, 'message': '1 jobs completed successfully', 'taskState': 'finished', 'taskResult': 'success', 'taskID': '9f10c0de-4964-4d89-977e-5e70e5d1f298'}, '1f969cce-3100-4627-8dac-f1fce93c03b3': {'code': 0, 'message': '1 jobs completed successfully', 'taskState': 'finished', 'taskResult': 'success', 'taskID': '1f969cce-3100-4627-8dac-f1fce93c03b3'}, 'fabb7f2b-09ed-4c6f-a737-fe9999d35eb7': {'code': 0, 'message': '1 jobs completed successfully', 'taskState': 'finished', 'taskResult': 'success', 'taskID': 'fabb7f2b-09ed-4c6f-a737-fe9999d35eb7'}, '5970e714-ee76-48b0-8509-b67e592baac4': {'code': 0, 'message': '1 jobs completed successfully', 'taskState': 'finished', 'taskResult': 'success', 'taskID': '5970e714-ee76-48b0-8509-b67e592baac4'}, '4cc75096-e02e-4b8d-98e9-0749bf484450': {'code': 0, 'message': '1 jobs completed successfully', 'taskState': 'finished', 'taskResult': 'success', 'taskID': '4cc75096-e02e-4b8d-98e9-0749bf484450'}, '12abf7d1-6df5-4d23-8c8a-b7c550bef406': {'code': 0, 'message': '1 jobs completed successfully', 'taskState': 'finished', 'taskResult': 'success', 'taskID': '12abf7d1-6df5-4d23-8c8a-b7c550bef406'}, '9510ff20-283a-4c4e-a67b-91faade156e2': {'code': 0, 'message': '1 jobs completed successfully', 'taskState': 'finished', 'taskResult': 'success', 'taskID': '9510ff20-283a-4c4e-a67b-91faade156e2'}, '51fcb0be-6f95-4f48-8007-69411fe45abb': {'code': 0, 'message': '1 jobs completed successfully', 'taskState': 'finished', 'taskResult': 'success', 'taskID': '51fcb0be-6f95-4f48-8007-69411fe45abb'}, '015acaca-4928-4060-abc0-6c7d49ee1412': {'code': 0, 'message': '1 jobs completed successfully', 'taskState': 'finished', 'taskResult': 'success', 'taskID': '015acaca-4928-4060-abc0-6c7d49ee1412'}, '1aca91b7-7c60-4728-943b-840435231992': {'code': 0, 'message': '1 jobs completed successfully', 'taskState': 'finished', 'taskResult': 'success', 'taskID': '1aca91b7-7c60-4728-943b-840435231992'}, '04cbd940-69ed-4f18-a646-8c742182a43d': {'code': 0, 'message': '1 jobs completed successfully', 'taskState': 'finished', 'taskResult': 'success', 'taskID': '04cbd940-69ed-4f18-a646-8c742182a43d'}, 'dc9ffe80-4214-4c97-ae0c-feaf781f4190': {'code': 0, 'message': '1 jobs completed successfully', 'taskState': 'finished', 'taskResult': 'success', 'taskID': 'dc9ffe80-4214-4c97-ae0c-feaf781f4190'}, '0ec928d6-fed1-413e-ae64-102b5551ffd6': {'code': 0, 'message': '1 jobs completed successfully', 'taskState': 'finished', 'taskResult': 'success', 'taskID': '0ec928d6-fed1-413e-ae64-102b5551ffd6'}, '5d16de2f-3e60-43f7-aa38-d38f683dd247': {'code': 0, 'message': '1 jobs completed successfully', 'taskState': 'finished', 'taskResult': 'success', 'taskID': '5d16de2f-3e60-43f7-aa38-d38f683dd247'}, 'dc777dca-5a89-41b3-9132-d06298abdf66': {'code': 0, 'message': '1 jobs completed successfully', 'taskState': 'finished', 'taskResult': 'success', 'taskID': 'dc777dca-5a89-41b3-9132-d06298abdf66'}, '347333ce-f463-49e2-b8b4-58291a1d1e71': {'code': 0, 'message': '1 jobs completed successfully', 'taskState': 'finished', 'taskResult': 'success', 'taskID': '347333ce-f463-49e2-b8b4-58291a1d1e71'}, '89b31229-d880-4f7f-b0d6-79c0f9b584c2': {'code': 0, 'message': '1 jobs completed successfully', 'taskState': 'finished', 'taskResult': 'success', 'taskID': '89b31229-d880-4f7f-b0d6-79c0f9b584c2'}, '4ae345db-09cb-4455-a94a-e5f54dba7dc3': {'code': 0, 'message': '1 jobs completed successfully', 'taskState': 'finished', 'taskResult': 'success', 'taskID': '4ae345db-09cb-4455-a94a-e5f54dba7dc3'}, 'e47c0497-65f6-472c-827b-158c59c74aee': {'code': 0, 'message': '1 jobs completed successfully', 'taskState': 'finished', 'taskResult': 'success', 'taskID': 'e47c0497-65f6-472c-827b-158c59c74aee'}, '71d4aafa-ac8b-430e-b8f6-485e4ffde67d': {'code': 0, 'message': '1 jobs completed successfully', 'taskState': 'finished', 'taskResult': 'success', 'taskID': '71d4aafa-ac8b-430e-b8f6-485e4ffde67d'}, '00744921-cf79-44db-84a8-c195e3cb3d88': {'code': 0, 'message': '1 jobs completed successfully', 'taskState': 'finished', 'taskResult': 'success', 'taskID': '00744921-cf79-44db-84a8-c195e3cb3d88'}, 'f380c2e8-ef80-4cc5-a5bf-227eac93ac12': {'code': 0, 'message': '1 jobs completed successfully', 'taskState': 'finished', 'taskResult': 'success', 'taskID': 'f380c2e8-ef80-4cc5-a5bf-227eac93ac12'}, 'adb119f2-2731-441a-8c5f-5d6feffaa628': {'code': 0, 'message': '1 jobs completed successfully', 'taskState': 'finished', 'taskResult': 'success', 'taskID': 'adb119f2-2731-441a-8c5f-5d6feffaa628'}, 'f6a11d78-c29b-413d-b3ef-cd95cf4f2388': {'code': 0, 'message': '1 jobs completed successfully', 'taskState': 'finished', 'taskResult': 'success', 'taskID': 'f6a11d78-c29b-413d-b3ef-cd95cf4f2388'}, 'b9d8a1b6-59f9-48d1-a585-53d8c6cb39fb': {'code': 0, 'message': '1 jobs completed successfully', 'taskState': 'finished', 'taskResult': 'success', 'taskID': 'b9d8a1b6-59f9-48d1-a585-53d8c6cb39fb'}, '1a557b8c-2c0e-4979-9220-10064724dad9': {'code': 0, 'message': '1 jobs completed successfully', 'taskState': 'finished', 'taskResult': 'success', 'taskID': '1a557b8c-2c0e-4979-9220-10064724dad9'}, 'e337a0c9-5c34-4b92-894c-58ab90656464': {'code': 0, 'message': '1 jobs completed successfully', 'taskState': 'finished', 'taskResult': 'success', 'taskID': 'e337a0c9-5c34-4b92-894c-58ab90656464'}, 'eef8223f-7838-46c3-b27d-56ee621a65a9': {'code': 0, 'message': '1 jobs completed successfully', 'taskState': 'finished', 'taskResult': 'success', 'taskID': 'eef8223f-7838-46c3-b27d-56ee621a65a9'}, '88594424-25cc-4224-9d8e-1ad715d25d4c': {'code': 0, 'message': '1 jobs completed successfully', 'taskState': 'finished', 'taskResult': 'success', 'taskID': '88594424-25cc-4224-9d8e-1ad715d25d4c'}, 'c4cb4f2e-8caf-47f7-82f8-8b15d2f6502e': {'code': 0, 'message': '1 jobs completed successfully', 'taskState': 'finished', 'taskResult': 'success', 'taskID': 'c4cb4f2e-8caf-47f7-82f8-8b15d2f6502e'}, '2d142116-a062-45af-9dbf-bd57ac276660': {'code': 0, 'message': '1 jobs completed successfully', 'taskState': 'finished', 'taskResult': 'success', 'taskID': '2d142116-a062-45af-9dbf-bd57ac276660'}, 'e0e4b4f0-44c7-4577-8098-a4a4111e43bf': {'code': 0, 'message': '1 jobs completed successfully', 'taskState': 'finished', 'taskResult': 'success', 'taskID': 'e0e4b4f0-44c7-4577-8098-a4a4111e43bf'}, '53708fa5-e991-4cd6-86b0-c7ae01cff2dd': {'code': 0, 'message': '1 jobs completed successfully', 'taskState': 'finished', 'taskResult': 'success', 'taskID': '53708fa5-e991-4cd6-86b0-c7ae01cff2dd'}, '06c8a081-6d50-4a50-ae2d-91c80dcd971b': {'code': 0, 'message': '1 jobs completed successfully', 'taskState': 'finished', 'taskResult': 'success', 'taskID': '06c8a081-6d50-4a50-ae2d-91c80dcd971b'}, '9f88fae0-50e6-45d4-ada7-fde0c28a61c2': {'code': 0, 'message': '1 jobs completed successfully', 'taskState': 'finished', 'taskResult': 'success', 'taskID': '9f88fae0-50e6-45d4-ada7-fde0c28a61c2'}, '54d8b678-e4f1-464f-9559-1e8c7079d45b': {'code': 0, 'message': '1 jobs completed successfully', 'taskState': 'finished', 'taskResult': 'success', 'taskID': '54d8b678-e4f1-464f-9559-1e8c7079d45b'}, 'f0aa125c-2838-41f2-aa54-5f8cc7785e89': {'code': 0, 'message': '1 jobs completed successfully', 'taskState': 'finished', 'taskResult': 'success', 'taskID': 'f0aa125c-2838-41f2-aa54-5f8cc7785e89'}, 'a0dda621-eed2-4d6a-b8a5-a902ba434f83': {'code': 0, 'message': '1 jobs completed successfully', 'taskState': 'finished', 'taskResult': 'success', 'taskID': 'a0dda621-eed2-4d6a-b8a5-a902ba434f83'}, 'ee031738-8f04-47d7-b089-c6e06dd4f3c5': {'code': 0, 'message': '1 jobs completed successfully', 'taskState': 'finished', 'taskResult': 'success', 'taskID': 'ee031738-8f04-47d7-b089-c6e06dd4f3c5'}, 'c2c19593-a412-4c02-9bf0-85a87d8754aa': {'code': 0, 'message': '1 jobs completed successfully', 'taskState': 'finished', 'taskResult': 'success', 'taskID': 'c2c19593-a412-4c02-9bf0-85a87d8754aa'}, '5ebc6d77-a833-4267-b436-8b2f9fc75b82': {'code': 0, 'message': '1 jobs completed successfully', 'taskState': 'finished', 'taskResult': 'success', 'taskID': '5ebc6d77-a833-4267-b436-8b2f9fc75b82'}}} (logUtils:54) Version-Release number of selected component (if applicable): 4.1.1-2 How reproducible: not clear Steps to Reproduce: 1. have the following profile: a cluster with 200 nested hosts connected to NFS server. 2. create vms for the nested cluster in bulks of 50. Actual results: vms stuck on locked status \ or disk locked status, and engine missed their monitoring results, due to rejected tasks. Expected results: tasks should not be rejected, or at least have some exception handling for fault and recovery handling. Additional info:
Roy, do I missed something ?!
I forgot to mention that we have hard coded BlockingQueue size for the quartz thread pool size, that may affect the problem, currently the queue size is 50. we should make it configurable as we configured the thread pool from the xml.in moreover, this queue is shared to all quartz threads ?!
I think that we could attempt to fix it by changing default queue size [1] that is used by the executor. This setting should not increase amount of resource being used and we could empirically check what value would work for our default setting of 250 hosts. [1] https://github.com/oVirt/ovirt-engine/blob/master/backend/manager/modules/scheduler/src/main/java/org/ovirt/engine/core/utils/timer/SchedulerThreadPool.java#L52
(In reply to Piotr Kliczewski from comment #16) > I think that we could attempt to fix it by changing default queue size [1] > that is used by the executor. This setting should not increase amount of > resource being used and we could empirically check what value would work for > our default setting of 250 hosts. > > > [1] > https://github.com/oVirt/ovirt-engine/blob/master/backend/manager/modules/ > scheduler/src/main/java/org/ovirt/engine/core/utils/timer/ > SchedulerThreadPool.java#L52 Right, thats sound reasonable fix. it will be great if this value will be editable as the qaurtzpool. also, it will be great to consider retry mechanism for rejected tasks.
The root cause here is that we loose the the minute it is rejected. This means that in case of a task with Fixed-Delay-Job, the listener will never reschedule the job for the next run. This means the system stop doing core jobs. It is a major bug. This means that increasing the number of threads is not a viable solution. You don't know when and under which conditions you are going to be rejected. How high is enough? 200, 220, double the amount of hosts? Handling properly a rejected task means we are able to handle peeks, with the limited resources we have without being affected. Thus bug was introduced once we moved to our implementation of the pool. Every version we have out there that is going to peek the 100 concurrent scheduled tasks is in danger of going zombie. We must not reject a task if we can schedule it atm. The original quarz impl never reject a task it seems - from 'org.quarz.simple.SimpleThreadPool' public boolean runInThread(Runnable runnable) { if (runnable == null) { return false; } synchronized (nextRunnableLock) { handoffPending = true; // Wait until a worker thread is available while ((availWorkers.size() < 1) && !isShutdown) { try { nextRunnableLock.wait(500); } catch (InterruptedException ignore) { } } if (!isShutdown) { WorkerThread wt = (WorkerThread)availWorkers.removeFirst(); busyWorkers.add(wt); wt.run(runnable); } else { // If the thread pool is going down, execute the Runnable // within a new additional worker thread (no thread from the pool). WorkerThread wt = new WorkerThread(this, threadGroup, "WorkerThread-LastJob", prio, isMakeThreadsDaemons(), runnable); busyWorkers.add(wt); workers.add(wt); wt.start(); } nextRunnableLock.notifyAll(); handoffPending = false; } return true; }
Piotr, could you please check with fakevdsm if reverting https://gerrit.ovirt.org/48667 will fix the issue? I mean if engine will be able to handle 250 fakevdsm hosts without increasing Quartz thread pool size?
It's enough to change the ovirt-engine.xml.in: <property name="org.quartz.threadPool.class" value="org.quartz.simpl.SimpleThreadPool"/> With this setup my system works with pool of 30 and 50 hosts. Futher tests need to be done.
Adding masayag and laravot who possibly saw COCO task dropping in some environments apparantely
Martin, The idea behind using ThreadPoolExecutor was that it is more efficient in therms of using resources when underutilized. If not used the threads are killed and created if required (slow start). This pool is more efficient. Using quartz provided pool would make it work again but we loose flexibility of executor. Configuring queue size and making sure that it is big enough for 250 hosts should be much better solution in therms of resource management.
(In reply to Piotr Kliczewski from comment #26) > Martin, > > The idea behind using ThreadPoolExecutor was that it is more efficient in > therms of using resources when underutilized. If not used the threads are > killed and created if required (slow start). This pool is more efficient. > Using quartz provided pool would make it work again but we loose flexibility > of executor. Configuring queue size and making sure that it is big enough > for 250 hosts should be much better solution in therms of resource > management. In what way is it more efficient? Less memory resources? Anything else?
(In reply to Oved Ourfali from comment #27) > (In reply to Piotr Kliczewski from comment #26) > > Martin, > > > > The idea behind using ThreadPoolExecutor was that it is more efficient in > > therms of using resources when underutilized. If not used the threads are > > killed and created if required (slow start). This pool is more efficient. > > Using quartz provided pool would make it work again but we loose flexibility > > of executor. Configuring queue size and making sure that it is big enough > > for 250 hosts should be much better solution in therms of resource > > management. > > In what way is it more efficient? > Less memory resources? Anything else? As I described above the # of threads being run is not constant and it adjusted for current utilization pattern so it would mean more optimal memory and cpu utilization. We just make sure that waiting queue is big enough so the tasks are not dropped. We already had a patch for it and we need to empirically check correct size for 250 hosts (our default limit).
We can test that as well. However, I haven't seen evidence that there are CPU/RAM issues to begin with, so not sure it is worth it.
(In reply to Roy Golan from comment #22) > The root cause here is that we loose the the minute it is rejected. This > means that in case of a task with Fixed-Delay-Job, the listener will never > reschedule the job for the next run. This means the system stop doing core > jobs. It is a major bug. > > This means that increasing the number of threads is not a viable solution. > You don't know when and under which conditions you are going to be rejected. > How high is enough? 200, 220, double the amount of hosts? > > Handling properly a rejected task means we are able to handle peeks, with > the limited resources we have without being affected. Thus bug was > introduced once we moved to our implementation of the pool. Every version we > have out there that is going to peek the 100 concurrent scheduled tasks is > in danger of going zombie. > > We must not reject a task if we can schedule it atm. > The original quarz impl never reject a task it seems - After discussing your findings with Piotr we have decided to continue with current patch, which increases queue size for tasks when thread pool is exhausted: 1. ThreadPoolExecutor is part of JDK and as this it's recommended solution (and we are going to move all our pools to use standard executors service anyway in 4.2) 2. Part of ThreadPoolExecutor is a definition of queue to hold tasks when thread pool doesn't have available threads. This is trying to solve the same issue in a different way than by resubmitting task as in original Quartz thread pool. We would prefer to use recommended solution rather then combining it with resubmitting. 3. Also ThreadPoolExecutor is able to manager threads in a smart way, it allocates thread when needed and remove them when not needed anymore (original Quartz thread pool allocated all threads on startup and hold them when they were not needed). Assuming that most of engine users doesn't have more than 100 hosts, this allows us to lower number of threads for most setups and improve performance by allocating more CPU time for other "useful" threads.
(In reply to Roy Golan from comment #25) > Adding masayag and laravot who possibly saw COCO task > dropping in some environments apparantely The same symptoms appeared in Bug 1417940, where it seems that CoCo task wasn't re-scheduling from a certain point of time. A bug in Quartz was raised as a possible cause.
Martin, the proposed patch doesn't address the core issue of rejection. It is not a solution. at the moment simply reverting is a viable solution (and masayag only adding is enforcing this assumption). The proposed solution will only hide the problem and you never know when again this will occur.
We plan a rework for that for 4.2. We will decide whether to revert or use the solution proposed above for 4.1. Anyhow, the solution needs to be carefully tested both by us and QE.
After long discussion we propose following solution: 1. Add configuration option to engine.conf.d, which will allow to specify which thread pool implementation we will use (either the default Quartz or the new one), the default will be original Quartz -> that should fix any issues around task rejections and reflect 3.6 behaviour 2. Add configuration options to engine.conf.d, which will allow to specify thread pool size (for both Quartz and new one) and queue size (only for new one) -> this will allow us to change sizing wihtout the need to edit ovirt-engine.xml.in which is overriden on each upgrade
heads up, I patched my setup with the posted patch. currently we can't identify any task rejection or similar issues. we'll verify it, once it will move to "on_qa"
(In reply to Eldad Marciano from comment #37) > heads up, I patched my setup with the posted patch. > currently we can't identify any task rejection or similar issues. > > we'll verify it, once it will move to "on_qa" engine config: ENGINE_QUARTZ_POOL_CLASS=org.ovirt.engine.core.utils.timer.SchedulerThreadPool ENGINE_QUARTZ_POOL_THREAD_SIZE=100 ENGINE_QUARTZ_POOL_QUEUE_SIZE=350 ENGINE_QUARTZ_POOL_REJECT_POLICY=true error rate: ==== engine error rate ==== JdbcConnectionException - 0 Tasks rejected - 0 heartbeat exceeded - 0 VM not responding - 0 Thread interrupts - 0 currently we verify it by: - engine restart - switching 47 hosts to maintenance and back to active.
Sorry, moved to MODIFIED by mistake
(In reply to Eldad Marciano from comment #0) > Description of problem: > > In part of scale out nested vms, some tasks were rejected and nothing will > happen since than. > specially for create vms or disks. > > we used the api to trigger vm creation and we found plenty of vms in locked > state forever, since the engine missed their monitoring task results. > > for instance let have a look on the engine output, we have plenty of: > > 2017-03-06 00:13:39,777+02 ERROR > [org.ovirt.engine.core.utils.timer.SchedulerThreadPool] > (DefaultQuartzScheduler_QuartzSchedulerThread) [] Task rejected due to: : > java.util.concurrent.RejectedExecutionException: Task > java.util.concurrent.FutureTask@cd2fd03 rejected from > java.util.concurrent.ThreadPoolExecutor@37a79d5e[Running, pool size = 250, > active threads = 250, queued tasks = 50, completed tasks = 250] > > and tasks which wait for complete status. > 2017-03-06 11:46:24,734+02 INFO > [org.ovirt.engine.core.bll.ConcurrentChildCommandsExecutionCallback] > (DefaultQuartzScheduler222) [7caad81a] Command 'AddVm' (id: > '31c2ae20-115a-4c34-b811-defa3bf118e6') waiting on child command id: > '2a3dc351-602e-434b-87a1-5594ad41bbec' type:'CreateSnapshotFromTemplate' to > complete > > > In time vdsm (SPM) reports that all tasks completed: > > 017-03-06 15:36:28,637+0200 INFO (jsonrpc/0) [dispatcher] Run and protect: > getAllTasksStatuses, Return response: {'allTasksStatus': > {'e0c32d8d-5691-4d9a-aedf-c063f9343f3c': {'code': 0, 'message': '1 jobs > completed successfully', 'taskState': 'finished', 'taskResult': 'success', > 'taskID': 'e0c32d8d-5691-4d9a-aedf-c063f9343f3c'}, > '10ecbc6b-c13d-4533-aeda-255b87104026': {'code': 0, 'message': '1 jobs > completed successfully', 'taskState': 'finished', 'taskResult': 'success', > 'taskID': '10ecbc6b-c13d-4533-aeda-255b87104026'}, > '2dc4e4b0-85d2-4700-8f09-bdd87ff0c2df': {'code': 0, 'message': '1 jobs > completed successfully', 'taskState': 'finished', 'taskResult': 'success', > 'taskID': '2dc4e4b0-85d2-4700-8f09-bdd87ff0c2df'}, > 'c0793e9f-6090-4251-a062-50baa0bf2cc1': {'code': 0, 'message': '1 jobs > completed successfully', 'taskState': 'finished', 'taskResult': 'success', > 'taskID': 'c0793e9f-6090-4251-a062-50baa0bf2cc1'}, > '446ae593-6e99-41bc-a5c0-f2d453cfed63': {'code': 0, 'message': '1 jobs > completed successfully', 'taskState': 'finished', 'taskResult': 'success', > 'taskID': '446ae593-6e99-41bc-a5c0-f2d453cfed63'}, > 'b8343b61-680e-4718-b1f0-4b96a2198cbd': {'code': 0, 'message': '1 jobs > completed successfully', 'taskState': 'finished', 'taskResult': 'success', > 'taskID': 'b8343b61-680e-4718-b1f0-4b96a2198cbd'}, > '3a5355b9-0d5d-4bc1-92d1-32902f704e98': {'code': 0, 'message': '1 jobs > completed successfully', 'taskState': 'finished', 'taskResult': 'success', > 'taskID': '3a5355b9-0d5d-4bc1-92d1-32902f704e98'}, > 'f7788864-9887-4540-83cd-079ed5c76be6': {'code': 0, 'message': '1 jobs > completed successfully', 'taskState': 'finished', 'taskResult': 'success', > 'taskID': 'f7788864-9887-4540-83cd-079ed5c76be6'}, > 'cabbd9c9-2788-4b25-9332-e099619df80a': {'code': 0, 'message': '1 jobs > completed successfully', 'taskState': 'finished', 'taskResult': 'success', > 'taskID': 'cabbd9c9-2788-4b25-9332-e099619df80a'}, > '83bf0607-2567-4a54-8f46-711ca197ec51': {'code': 0, 'message': '1 jobs > completed successfully', 'taskState': 'finished', 'taskResult': 'success', > 'taskID': '83bf0607-2567-4a54-8f46-711ca197ec51'}, > 'd4775ebe-be2b-4259-8dbd-1c115d885887': {'code': 0, 'message': '1 jobs > completed successfully', 'taskState': 'finished', 'taskResult': 'success', > 'taskID': 'd4775ebe-be2b-4259-8dbd-1c115d885887'}, > '8e2056eb-a138-47ae-9bde-de02e283d8d8': {'code': 0, 'message': '1 jobs > completed successfully', 'taskState': 'finished', 'taskResult': 'success', > 'taskID': '8e2056eb-a138-47ae-9bde-de02e283d8d8'}, > '58dc37e2-3504-4816-929e-84d70a4e2253': {'code': 0, 'message': '1 jobs > completed successfully', 'taskState': 'finished', 'taskResult': 'success', > 'taskID': '58dc37e2-3504-4816-929e-84d70a4e2253'}, > '9f10c0de-4964-4d89-977e-5e70e5d1f298': {'code': 0, 'message': '1 jobs > completed successfully', 'taskState': 'finished', 'taskResult': 'success', > 'taskID': '9f10c0de-4964-4d89-977e-5e70e5d1f298'}, > '1f969cce-3100-4627-8dac-f1fce93c03b3': {'code': 0, 'message': '1 jobs > completed successfully', 'taskState': 'finished', 'taskResult': 'success', > 'taskID': '1f969cce-3100-4627-8dac-f1fce93c03b3'}, > 'fabb7f2b-09ed-4c6f-a737-fe9999d35eb7': {'code': 0, 'message': '1 jobs > completed successfully', 'taskState': 'finished', 'taskResult': 'success', > 'taskID': 'fabb7f2b-09ed-4c6f-a737-fe9999d35eb7'}, > '5970e714-ee76-48b0-8509-b67e592baac4': {'code': 0, 'message': '1 jobs > completed successfully', 'taskState': 'finished', 'taskResult': 'success', > 'taskID': '5970e714-ee76-48b0-8509-b67e592baac4'}, > '4cc75096-e02e-4b8d-98e9-0749bf484450': {'code': 0, 'message': '1 jobs > completed successfully', 'taskState': 'finished', 'taskResult': 'success', > 'taskID': '4cc75096-e02e-4b8d-98e9-0749bf484450'}, > '12abf7d1-6df5-4d23-8c8a-b7c550bef406': {'code': 0, 'message': '1 jobs > completed successfully', 'taskState': 'finished', 'taskResult': 'success', > 'taskID': '12abf7d1-6df5-4d23-8c8a-b7c550bef406'}, > '9510ff20-283a-4c4e-a67b-91faade156e2': {'code': 0, 'message': '1 jobs > completed successfully', 'taskState': 'finished', 'taskResult': 'success', > 'taskID': '9510ff20-283a-4c4e-a67b-91faade156e2'}, > '51fcb0be-6f95-4f48-8007-69411fe45abb': {'code': 0, 'message': '1 jobs > completed successfully', 'taskState': 'finished', 'taskResult': 'success', > 'taskID': '51fcb0be-6f95-4f48-8007-69411fe45abb'}, > '015acaca-4928-4060-abc0-6c7d49ee1412': {'code': 0, 'message': '1 jobs > completed successfully', 'taskState': 'finished', 'taskResult': 'success', > 'taskID': '015acaca-4928-4060-abc0-6c7d49ee1412'}, > '1aca91b7-7c60-4728-943b-840435231992': {'code': 0, 'message': '1 jobs > completed successfully', 'taskState': 'finished', 'taskResult': 'success', > 'taskID': '1aca91b7-7c60-4728-943b-840435231992'}, > '04cbd940-69ed-4f18-a646-8c742182a43d': {'code': 0, 'message': '1 jobs > completed successfully', 'taskState': 'finished', 'taskResult': 'success', > 'taskID': '04cbd940-69ed-4f18-a646-8c742182a43d'}, > 'dc9ffe80-4214-4c97-ae0c-feaf781f4190': {'code': 0, 'message': '1 jobs > completed successfully', 'taskState': 'finished', 'taskResult': 'success', > 'taskID': 'dc9ffe80-4214-4c97-ae0c-feaf781f4190'}, > '0ec928d6-fed1-413e-ae64-102b5551ffd6': {'code': 0, 'message': '1 jobs > completed successfully', 'taskState': 'finished', 'taskResult': 'success', > 'taskID': '0ec928d6-fed1-413e-ae64-102b5551ffd6'}, > '5d16de2f-3e60-43f7-aa38-d38f683dd247': {'code': 0, 'message': '1 jobs > completed successfully', 'taskState': 'finished', 'taskResult': 'success', > 'taskID': '5d16de2f-3e60-43f7-aa38-d38f683dd247'}, > 'dc777dca-5a89-41b3-9132-d06298abdf66': {'code': 0, 'message': '1 jobs > completed successfully', 'taskState': 'finished', 'taskResult': 'success', > 'taskID': 'dc777dca-5a89-41b3-9132-d06298abdf66'}, > '347333ce-f463-49e2-b8b4-58291a1d1e71': {'code': 0, 'message': '1 jobs > completed successfully', 'taskState': 'finished', 'taskResult': 'success', > 'taskID': '347333ce-f463-49e2-b8b4-58291a1d1e71'}, > '89b31229-d880-4f7f-b0d6-79c0f9b584c2': {'code': 0, 'message': '1 jobs > completed successfully', 'taskState': 'finished', 'taskResult': 'success', > 'taskID': '89b31229-d880-4f7f-b0d6-79c0f9b584c2'}, > '4ae345db-09cb-4455-a94a-e5f54dba7dc3': {'code': 0, 'message': '1 jobs > completed successfully', 'taskState': 'finished', 'taskResult': 'success', > 'taskID': '4ae345db-09cb-4455-a94a-e5f54dba7dc3'}, > 'e47c0497-65f6-472c-827b-158c59c74aee': {'code': 0, 'message': '1 jobs > completed successfully', 'taskState': 'finished', 'taskResult': 'success', > 'taskID': 'e47c0497-65f6-472c-827b-158c59c74aee'}, > '71d4aafa-ac8b-430e-b8f6-485e4ffde67d': {'code': 0, 'message': '1 jobs > completed successfully', 'taskState': 'finished', 'taskResult': 'success', > 'taskID': '71d4aafa-ac8b-430e-b8f6-485e4ffde67d'}, > '00744921-cf79-44db-84a8-c195e3cb3d88': {'code': 0, 'message': '1 jobs > completed successfully', 'taskState': 'finished', 'taskResult': 'success', > 'taskID': '00744921-cf79-44db-84a8-c195e3cb3d88'}, > 'f380c2e8-ef80-4cc5-a5bf-227eac93ac12': {'code': 0, 'message': '1 jobs > completed successfully', 'taskState': 'finished', 'taskResult': 'success', > 'taskID': 'f380c2e8-ef80-4cc5-a5bf-227eac93ac12'}, > 'adb119f2-2731-441a-8c5f-5d6feffaa628': {'code': 0, 'message': '1 jobs > completed successfully', 'taskState': 'finished', 'taskResult': 'success', > 'taskID': 'adb119f2-2731-441a-8c5f-5d6feffaa628'}, > 'f6a11d78-c29b-413d-b3ef-cd95cf4f2388': {'code': 0, 'message': '1 jobs > completed successfully', 'taskState': 'finished', 'taskResult': 'success', > 'taskID': 'f6a11d78-c29b-413d-b3ef-cd95cf4f2388'}, > 'b9d8a1b6-59f9-48d1-a585-53d8c6cb39fb': {'code': 0, 'message': '1 jobs > completed successfully', 'taskState': 'finished', 'taskResult': 'success', > 'taskID': 'b9d8a1b6-59f9-48d1-a585-53d8c6cb39fb'}, > '1a557b8c-2c0e-4979-9220-10064724dad9': {'code': 0, 'message': '1 jobs > completed successfully', 'taskState': 'finished', 'taskResult': 'success', > 'taskID': '1a557b8c-2c0e-4979-9220-10064724dad9'}, > 'e337a0c9-5c34-4b92-894c-58ab90656464': {'code': 0, 'message': '1 jobs > completed successfully', 'taskState': 'finished', 'taskResult': 'success', > 'taskID': 'e337a0c9-5c34-4b92-894c-58ab90656464'}, > 'eef8223f-7838-46c3-b27d-56ee621a65a9': {'code': 0, 'message': '1 jobs > completed successfully', 'taskState': 'finished', 'taskResult': 'success', > 'taskID': 'eef8223f-7838-46c3-b27d-56ee621a65a9'}, > '88594424-25cc-4224-9d8e-1ad715d25d4c': {'code': 0, 'message': '1 jobs > completed successfully', 'taskState': 'finished', 'taskResult': 'success', > 'taskID': '88594424-25cc-4224-9d8e-1ad715d25d4c'}, > 'c4cb4f2e-8caf-47f7-82f8-8b15d2f6502e': {'code': 0, 'message': '1 jobs > completed successfully', 'taskState': 'finished', 'taskResult': 'success', > 'taskID': 'c4cb4f2e-8caf-47f7-82f8-8b15d2f6502e'}, > '2d142116-a062-45af-9dbf-bd57ac276660': {'code': 0, 'message': '1 jobs > completed successfully', 'taskState': 'finished', 'taskResult': 'success', > 'taskID': '2d142116-a062-45af-9dbf-bd57ac276660'}, > 'e0e4b4f0-44c7-4577-8098-a4a4111e43bf': {'code': 0, 'message': '1 jobs > completed successfully', 'taskState': 'finished', 'taskResult': 'success', > 'taskID': 'e0e4b4f0-44c7-4577-8098-a4a4111e43bf'}, > '53708fa5-e991-4cd6-86b0-c7ae01cff2dd': {'code': 0, 'message': '1 jobs > completed successfully', 'taskState': 'finished', 'taskResult': 'success', > 'taskID': '53708fa5-e991-4cd6-86b0-c7ae01cff2dd'}, > '06c8a081-6d50-4a50-ae2d-91c80dcd971b': {'code': 0, 'message': '1 jobs > completed successfully', 'taskState': 'finished', 'taskResult': 'success', > 'taskID': '06c8a081-6d50-4a50-ae2d-91c80dcd971b'}, > '9f88fae0-50e6-45d4-ada7-fde0c28a61c2': {'code': 0, 'message': '1 jobs > completed successfully', 'taskState': 'finished', 'taskResult': 'success', > 'taskID': '9f88fae0-50e6-45d4-ada7-fde0c28a61c2'}, > '54d8b678-e4f1-464f-9559-1e8c7079d45b': {'code': 0, 'message': '1 jobs > completed successfully', 'taskState': 'finished', 'taskResult': 'success', > 'taskID': '54d8b678-e4f1-464f-9559-1e8c7079d45b'}, > 'f0aa125c-2838-41f2-aa54-5f8cc7785e89': {'code': 0, 'message': '1 jobs > completed successfully', 'taskState': 'finished', 'taskResult': 'success', > 'taskID': 'f0aa125c-2838-41f2-aa54-5f8cc7785e89'}, > 'a0dda621-eed2-4d6a-b8a5-a902ba434f83': {'code': 0, 'message': '1 jobs > completed successfully', 'taskState': 'finished', 'taskResult': 'success', > 'taskID': 'a0dda621-eed2-4d6a-b8a5-a902ba434f83'}, > 'ee031738-8f04-47d7-b089-c6e06dd4f3c5': {'code': 0, 'message': '1 jobs > completed successfully', 'taskState': 'finished', 'taskResult': 'success', > 'taskID': 'ee031738-8f04-47d7-b089-c6e06dd4f3c5'}, > 'c2c19593-a412-4c02-9bf0-85a87d8754aa': {'code': 0, 'message': '1 jobs > completed successfully', 'taskState': 'finished', 'taskResult': 'success', > 'taskID': 'c2c19593-a412-4c02-9bf0-85a87d8754aa'}, > '5ebc6d77-a833-4267-b436-8b2f9fc75b82': {'code': 0, 'message': '1 jobs > completed successfully', 'taskState': 'finished', 'taskResult': 'success', > 'taskID': '5ebc6d77-a833-4267-b436-8b2f9fc75b82'}}} (logUtils:54) > > > Version-Release number of selected component (if applicable): > 4.1.1-2 > > How reproducible: > not clear > > Steps to Reproduce: > 1. have the following profile: a cluster with 200 nested hosts connected to > NFS server. > 2. create vms for the nested cluster in bulks of 50. > > Actual results: > vms stuck on locked status \ or disk locked status, and engine missed their > monitoring results, due to rejected tasks. > > Expected results: > tasks should not be rejected, or at least have some exception handling for > fault and recovery handling. > > Additional info: currently tried finding the same error that started this issue :according to :Task rejected due to ,,, xzgrep "Task rejected due to" /var/log/ovirt-engine/engine.log* 4.1.1.2-0.1.el7 i dont see it with currently 250 host (100 are up) + ~1300 Vms up and based on this verification in comment 37 - moving to verify (In reply to Eldad Marciano from comment #38) > (In reply to Eldad Marciano from comment #37) > > heads up, I patched my setup with the posted patch. > > currently we can't identify any task rejection or similar issues. > > > > we'll verify it, once it will move to "on_qa" > > engine config: > > ENGINE_QUARTZ_POOL_CLASS=org.ovirt.engine.core.utils.timer. > SchedulerThreadPool > ENGINE_QUARTZ_POOL_THREAD_SIZE=100 > ENGINE_QUARTZ_POOL_QUEUE_SIZE=350 > ENGINE_QUARTZ_POOL_REJECT_POLICY=true > > error rate: > ==== engine error rate ==== > JdbcConnectionException - 0 > Tasks rejected - 0 > heartbeat exceeded - 0 > VM not responding - 0 > Thread interrupts - 0 > > currently we verify it by: > - engine restart > - switching 47 hosts to maintenance and back to active.
*** Bug 1506373 has been marked as a duplicate of this bug. ***
*** Bug 1417940 has been marked as a duplicate of this bug. ***