Bug 1495723 - Several jobs are not finished
Summary: Several jobs are not finished
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: ovirt-engine
Classification: oVirt
Component: BLL.Virt
Version: 4.2.0
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: ovirt-4.2.0
: 4.2.0
Assignee: Ravi Nori
QA Contact: Raz Tamir
URL:
Whiteboard:
: 1497958 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-09-26 11:45 UTC by Raz Tamir
Modified: 2017-12-20 11:38 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-12-20 11:38:01 UTC
oVirt Team: Infra
Embargoed:
rule-engine: ovirt-4.2+
rule-engine: blocker+


Attachments (Terms of Use)
engine log (367.09 KB, text/plain)
2017-09-26 11:45 UTC, Raz Tamir
no flags Details
engine log for snapshot preview failure (804.79 KB, application/x-gzip)
2017-09-27 08:49 UTC, Michael Burman
no flags Details
engine.log (10.45 MB, text/plain)
2017-09-27 13:04 UTC, Mor
no flags Details
engine log 4.1.7-2 (379.11 KB, application/x-gzip)
2017-10-01 07:08 UTC, Michael Burman
no flags Details


Links
System ID Private Priority Status Summary Last Updated
oVirt gerrit 82396 0 master MERGED engine : Several jobs are not finished 2017-10-02 19:03:32 UTC

Description Raz Tamir 2017-09-26 11:45:17 UTC
Description of problem:

We see in our automation that sometimes a Remove VM job (/api/jobs) is stuck on 'started' and not finished.

The VM is removed properly

2017-09-25 01:53:54,320+03 INFO  [org.ovirt.engine.core.bll.tasks.SPMAsyncTask] (default task-2) [vms_delete_be936409-fa9a-48ea] BaseAsyncTask::startPollingTask: Starting to poll task 'f5222827-fa6b-4b06-8a55-6ac3dfb03dd2'.
2017-09-25 01:53:54,461+03 INFO  [org.ovirt.engine.core.bll.RemoveVmCommand] (default task-14) [vms_delete_562ac0eb-82ef-49a1] Lock Acquired to object 'EngineLock:{exclusiveLocks='[d3ae24b2-8f62-4d82-8d7e-00cdadc68776=VM]', sharedLocks=''}'
2017-09-25 01:53:54,477+03 INFO  [org.ovirt.engine.core.bll.RemoveVmCommand] (default task-14) [vms_delete_562ac0eb-82ef-49a1] Running command: RemoveVmCommand internal: false. Entities affected :  ID: d3ae24b2-8f62-4d82-8d7e-00cdadc68776 Type: VMAction group DELETE_VM with role type USER
2017-09-25 01:53:54,480+03 INFO  [org.ovirt.engine.core.vdsbroker.SetVmStatusVDSCommand] (default task-14) [vms_delete_562ac0eb-82ef-49a1] START, SetVmStatusVDSCommand( SetVmStatusVDSCommandParameters:{vmId='d3ae24b2-8f62-4d82-8d7e-00cdadc68776', status='ImageLocked', exitStatus='Normal'}), log id: 34f3a9ad
2017-09-25 01:53:54,480+03 INFO  [org.ovirt.engine.core.vdsbroker.SetVmStatusVDSCommand] (default task-14) [vms_delete_562ac0eb-82ef-49a1] FINISH, SetVmStatusVDSCommand, log id: 34f3a9ad
2017-09-25 01:53:54,485+03 INFO  [org.ovirt.engine.core.bll.RemoveVmCommand] (default task-14) [vms_delete_562ac0eb-82ef-49a1] Lock freed to object 'EngineLock:{exclusiveLocks='[d3ae24b2-8f62-4d82-8d7e-00cdadc68776=VM]', sharedLocks=''}'
2017-09-25 01:53:54,501+03 INFO  [org.ovirt.engine.core.bll.storage.disk.image.RemoveAllVmImagesCommand] (default task-14) [vms_delete_562ac0eb-82ef-49a1] Running command: RemoveAllVmImagesCommand internal: true. Entities affected :  ID: d3ae24b2-8f62-4d82-8d7e-00cdadc68776 Type: VM
2017-09-25 01:53:54,508+03 INFO  [org.ovirt.engine.core.bll.storage.disk.image.RemoveImageCommand] (default task-14) [vms_delete_562ac0eb-82ef-49a1] Running command: RemoveImageCommand internal: true. Entities affected :  ID: 00000000-0000-0000-0000-000000000000 Type: Storage
2017-09-25 01:53:54,534+03 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.DeleteImageGroupVDSCommand] (default task-14) [vms_delete_562ac0eb-82ef-49a1] START, DeleteImageGroupVDSCommand( DeleteImageGroupVDSCommandParameters:{storagePoolId='873ae459-99ff-4fea-8320-54ad833fb3be', ignoreFailoverLimit='false', storageDomainId='7adee3ea-8c39-4aea-ba65-6122984cc684', imageGroupId='ffaa7588-2d04-4259-8dc8-ee95b4d36efe', postZeros='false', discard='true', forceDelete='false'}), log id: 6e1428d7
2017-09-25 01:53:55,099+03 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.DeleteImageGroupVDSCommand] (default task-14) [vms_delete_562ac0eb-82ef-49a1] FINISH, DeleteImageGroupVDSCommand, log id: 6e1428d7
2017-09-25 01:53:55,105+03 INFO  [org.ovirt.engine.core.bll.tasks.CommandAsyncTask] (default task-14) [vms_delete_562ac0eb-82ef-49a1] CommandAsyncTask::Adding CommandMultiAsyncTasks object for command '771fab80-431f-46c8-be49-1553e19d40ef'
2017-09-25 01:53:55,105+03 INFO  [org.ovirt.engine.core.bll.CommandMultiAsyncTasks] (default task-14) [vms_delete_562ac0eb-82ef-49a1] CommandMultiAsyncTasks::attachTask: Attaching task '63ef7c11-4600-4b1d-81c0-691d8b7eeddb' to command '771fab80-431f-46c8-be49-1553e19d40ef'.
2017-09-25 01:53:55,113+03 INFO  [org.ovirt.engine.core.bll.tasks.AsyncTaskManager] (default task-14) [vms_delete_562ac0eb-82ef-49a1] Adding task '63ef7c11-4600-4b1d-81c0-691d8b7eeddb' (Parent Command 'RemoveVm', Parameters Type 'org.ovirt.engine.core.common.asynctasks.AsyncTaskParameters'), polling hasn't started yet..
2017-09-25 01:53:55,154+03 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (default task-14) [vms_delete_562ac0eb-82ef-49a1] EVENT_ID: USER_REMOVE_VM_FINISHED(113), VM vm_type_2501521787 was successfully removed.
2017-09-25 01:53:55,154+03 INFO  [org.ovirt.engine.core.bll.tasks.SPMAsyncTask] (default task-14) [vms_delete_562ac0eb-82ef-49a1] BaseAsyncTask::startPollingTask: Starting to poll task '63ef7c11-4600-4b1d-81c0-691d8b7eeddb'.
2017-09-25 01:54:00,435+03 INFO  [org.ovirt.engine.core.bll.tasks.AsyncTaskManager] (EE-ManagedThreadFactory-engineScheduled-Thread-97) [] Polling and updating Async Tasks: 11 tasks, 4 tasks to poll now
2017-09-25 01:54:00,447+03 INFO  [org.ovirt.engine.core.bll.tasks.SPMAsyncTask] (EE-ManagedThreadFactory-engineScheduled-Thread-97) [] SPMAsyncTask::PollTask: Polling task '04e4cae0-b573-4f00-9a20-2bf1588638b1' (Parent Command 'RemoveVmFromImportExport', Parameters Type 'org.ovirt.engine.core.common.asynctasks.AsyncTaskParameters') returned status 'finished', result 'success'.


2017-09-25 01:54:00,549+03 INFO  [org.ovirt.engine.core.bll.RemoveVmCommand] (org.ovirt.thread.EE-ManagedThreadFactory-engine-Thread-11) [vms_delete_562ac0eb-82ef-49a1] Ending command 'org.ovirt.engine.core.bll.RemoveVmCommand' successfully.
2017-09-25 01:54:00,550+03 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMClearTaskVDSCommand] (org.ovirt.thread.EE-ManagedThreadFactory-engine-Thread-14) [vms_delete_976d6006-7047-4a2b] FINISH, HSMClearTaskVDSCommand, log id: fb439e1
2017-09-25 01:54:00,550+03 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.SPMClearTaskVDSCommand] (org.ovirt.thread.EE-ManagedThreadFactory-engine-Thread-14) [vms_delete_976d6006-7047-4a2b] FINISH, SPMClearTaskVDSCommand, log id: 2df2e899
2017-09-25 01:54:00,553+03 INFO  [org.ovirt.engine.core.bll.tasks.CommandAsyncTask] (org.ovirt.thread.EE-ManagedThreadFactory-engine-Thread-11) [vms_delete_562ac0eb-82ef-49a1] CommandAsyncTask::HandleEndActionResult [within thread]: endAction for action type 'RemoveVm' completed, handling the result.
2017-09-25 01:54:00,553+03 INFO  [org.ovirt.engine.core.bll.tasks.CommandAsyncTask] (org.ovirt.thread.EE-ManagedThreadFactory-engine-Thread-11) [vms_delete_562ac0eb-82ef-49a1] CommandAsyncTask::HandleEndActionResult [within thread]: endAction for action type 'RemoveVm' succeeded, clearing tasks.
2017-09-25 01:54:00,553+03 INFO  [org.ovirt.engine.core.bll.tasks.SPMAsyncTask] (org.ovirt.thread.EE-ManagedThreadFactory-engine-Thread-11) [vms_delete_562ac0eb-82ef-49a1] SPMAsyncTask::ClearAsyncTask: Attempting to clear task '63ef7c11-4600-4b1d-81c0-691d8b7eeddb'
2017-09-25 01:54:00,554+03 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.SPMClearTaskVDSCommand] (org.ovirt.thread.EE-ManagedThreadFactory-engine-Thread-11) [vms_delete_562ac0eb-82ef-49a1] START, SPMClearTaskVDSCommand( SPMTaskGuidBaseVDSCommandParameters:{storagePoolId='873ae459-99ff-4fea-8320-54ad833fb3be', ignoreFailoverLimit='false', taskId='63ef7c11-4600-4b1d-81c0-691d8b7eeddb'}), log id: 571c65f2
2017-09-25 01:54:00,554+03 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMClearTaskVDSCommand] (org.ovirt.thread.EE-ManagedThreadFactory-engine-Thread-11) [vms_delete_562ac0eb-82ef-49a1] START, HSMClearTaskVDSCommand(HostName = host_mixed_3, HSMTaskGuidBaseVDSCommandParameters:{hostId='491b2219-ba56-4ef8-a55f-885cb338c2c8', taskId='63ef7c11-4600-4b1d-81c0-691d8b7eeddb'}), log id: 20b3689c
2017-09-25 01:54:00,563+03 INFO  [org.ovirt.engine.core.bll.RemoveVmCommand] (org.ovirt.thread.EE-ManagedThreadFactory-engine-Thread-18) [vms_delete_be936409-fa9a-48ea] Ending command 'org.ovirt.engine.core.bll.RemoveVmCommand' successfully.
2017-09-25 01:54:00,564+03 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMClearTaskVDSCommand] (org.ovirt.thread.EE-ManagedThreadFactory-engine-Thread-11) [vms_delete_562ac0eb-82ef-49a1] FINISH, HSMClearTaskVDSCommand, log id: 20b3689c
2017-09-25 01:54:00,564+03 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.SPMClearTaskVDSCommand] (org.ovirt.thread.EE-ManagedThreadFactory-engine-Thread-11) [vms_delete_562ac0eb-82ef-49a1] FINISH, SPMClearTaskVDSCommand, log id: 571c65f2
2017-09-25 01:54:00,566+03 INFO  [org.ovirt.engine.core.bll.tasks.CommandAsyncTask] (org.ovirt.thread.EE-ManagedThreadFactory-engine-Thread-18) [vms_delete_be936409-fa9a-48ea] CommandAsyncTask::HandleEndActionResult [within thread]: endAction for action type 'RemoveVm' completed, handling the result.
2017-09-25 01:54:00,566+03 INFO  [org.ovirt.engine.core.bll.tasks.CommandAsyncTask] (org.ovirt.thread.EE-ManagedThreadFactory-engine-Thread-18) [vms_delete_be936409-fa9a-48ea] CommandAsyncTask::HandleEndActionResult [within thread]: endAction for action type 'RemoveVm' succeeded, clearing tasks.
2017-09-25 01:54:00,566+03 INFO  [org.ovirt.engine.core.bll.tasks.SPMAsyncTask] (org.ovirt.thread.EE-ManagedThreadFactory-engine-Thread-18) [vms_delete_be936409-fa9a-48ea] SPMAsyncTask::ClearAsyncTask: Attempting to clear task 'f5222827-fa6b-4b06-8a55-6ac3dfb03dd2'
2017-09-25 01:54:00,567+03 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.SPMClearTaskVDSCommand] (org.ovirt.thread.EE-ManagedThreadFactory-engine-Thread-18) [vms_delete_be936409-fa9a-48ea] START, SPMClearTaskVDSCommand( SPMTaskGuidBaseVDSCommandParameters:{storagePoolId='873ae459-99ff-4fea-8320-54ad833fb3be', ignoreFailoverLimit='false', taskId='f5222827-fa6b-4b06-8a55-6ac3dfb03dd2'}), log id: 68425f64
2017-09-25 01:54:00,568+03 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMClearTaskVDSCommand] (org.ovirt.thread.EE-ManagedThreadFactory-engine-Thread-18) [vms_delete_be936409-fa9a-48ea] START, HSMClearTaskVDSCommand(HostName = host_mixed_3, HSMTaskGuidBaseVDSCommandParameters:{hostId='491b2219-ba56-4ef8-a55f-885cb338c2c8', taskId='f5222827-fa6b-4b06-8a55-6ac3dfb03dd2'}), log id: 66c683ea
2017-09-25 01:54:00,570+03 INFO  [org.ovirt.engine.core.bll.tasks.SPMAsyncTask] (org.ovirt.thread.EE-ManagedThreadFactory-engine-Thread-9) [vms_delete_ac2c6a35-0515-4555] BaseAsyncTask::removeTaskFromDB: Removed task '04e4cae0-b573-4f00-9a20-2bf1588638b1' from DataBase
2017-09-25 01:54:00,570+03 INFO  [org.ovirt.engine.core.bll.tasks.CommandAsyncTask] (org.ovirt.thread.EE-ManagedThreadFactory-engine-Thread-9) [vms_delete_ac2c6a35-0515-4555] CommandAsyncTask::HandleEndActionResult [within thread]: Removing CommandMultiAsyncTasks object for entity '0e252c14-0c72-4b0e-9a02-5baf5a38ff6b'
2017-09-25 01:54:00,572+03 INFO  [org.ovirt.engine.core.bll.tasks.SPMAsyncTask] (org.ovirt.thread.EE-ManagedThreadFactory-engine-Thread-14) [vms_delete_976d6006-7047-4a2b] BaseAsyncTask::removeTaskFromDB: Removed task '6b0015e7-01fb-4979-9f61-98b23f2f071d' from DataBase
2017-09-25 01:54:00,572+03 INFO  [org.ovirt.engine.core.bll.tasks.CommandAsyncTask] (org.ovirt.thread.EE-ManagedThreadFactory-engine-Thread-14) [vms_delete_976d6006-7047-4a2b] CommandAsyncTask::HandleEndActionResult [within thread]: Removing CommandMultiAsyncTasks object for entity '0c9d8bbe-14f7-4ad8-9cc9-9cd975a58a48'
2017-09-25 01:54:00,575+03 INFO  [org.ovirt.engine.core.bll.tasks.SPMAsyncTask] (org.ovirt.thread.EE-ManagedThreadFactory-engine-Thread-11) [vms_delete_562ac0eb-82ef-49a1] BaseAsyncTask::removeTaskFromDB: Removed task '63ef7c11-4600-4b1d-81c0-691d8b7eeddb' from DataBase
2017-09-25 01:54:00,575+03 INFO  [org.ovirt.engine.core.bll.tasks.CommandAsyncTask] (org.ovirt.thread.EE-ManagedThreadFactory-engine-Thread-11) [vms_delete_562ac0eb-82ef-49a1] CommandAsyncTask::HandleEndActionResult [within thread]: Removing CommandMultiAsyncTasks object for entity '771fab80-431f-46c8-be49-1553e19d40ef'



Version-Release number of selected component (if applicable):
ovirt-engine-4.2.0-0.0.master.20170921184504.gitfcfc9a7.el7.centos

How reproducible:
Don't know

Steps to Reproduce:
1. Don't know
2.
3.

Actual results:
VM removed but the job is stuck in 'started' under /api/jobs

Expected results:


Additional info:

Comment 1 Raz Tamir 2017-09-26 11:45:39 UTC
Created attachment 1331011 [details]
engine log

Comment 2 Michael Burman 2017-09-27 08:47:02 UTC
Another job isn't finished and stuck in the engine db.
Them image is locked for ever and can't be released. 
Preview VM Snapshot is failed and engine aware of it, but the job keep running for ever. NO time out. 

<jobs>
<jobhref="/ovirt-engine/api/jobs/19f3e594-a84c-406e-bd0f-5bf681179fc1"id="19f3e594-a84c-406e-bd0f-5bf681179fc1">
<actions>
<linkhref="/ovirt-engine/api/jobs/19f3e594-a84c-406e-bd0f-5bf681179fc1/clear"rel="clear"/>
<linkhref="/ovirt-engine/api/jobs/19f3e594-a84c-406e-bd0f-5bf681179fc1/end"rel="end"/>
</actions>
<description>Preview VM Snapshot snap4 of VM VM6</description>
<linkhref="/ovirt-engine/api/jobs/19f3e594-a84c-406e-bd0f-5bf681179fc1/steps"rel="steps"/>
<auto_cleared>true</auto_cleared>
<external>false</external>
<last_updated>2017-09-27T10:52:18.730+03:00</last_updated>
<start_time>2017-09-27T10:52:16.205+03:00</start_time>
<status>started</status>
<ownerhref="/ovirt-engine/api/users/586c19dc-00b9-00fa-0364-00000000012f"id="586c19dc-00b9-00fa-0364-00000000012f"/>
</job>

- Snapshot-Preview snap4 for VM VM6 was initiated by admin@internal-authz.
- No MAC addresses left in the MAC Address Pool.
- Some MAC addresses had to be reallocated, but operation failed because of insufficient amount of free MACs.
- Failed to complete Snapshot-Preview snap4 for VM VM6.

- In case of snapshot preview and we need to allocate new MAC for the VM and we have no MACs left, the operation must failed. It is indeed failing and engine aware of the failure, but the job run for ever and stuck in Finalize state. 
The most worst thing here is that the image got locked for ever!

- Steps to reproduce - 
1) Master 4.2
2) 2 VMs in cluster
3) MAC pool range with only
one MAC address in cluster
4) Start VM1 with 1 vNIC
with MAC 'z'
5) Create snapshot from VM1
6) Unplug the vNIC from VM1 and give MAC 'w'(not from the pool) to VM1 and assign MAC address 'z'(the origin MAC) to VM2
7) Try to preview VM1 from snapshot 
Expected Result - should fail
Actual result - Engine failed operation - 
- Snapshot-Preview snap4 for VM VM6 was initiated by admin@internal-authz.
- No MAC addresses left in the MAC Address Pool.
- Some MAC addresses had to be reallocated, but operation failed because of insufficient amount of free MACs.
- Failed to complete Snapshot-Preview snap4 for VM VM6.
But, the job is stuck and image stay locked! 

Let me know if it's a new bug or the same, maybe we have some general bug with stuck jobs in the engine.

Comment 3 Michael Burman 2017-09-27 08:49:22 UTC
Created attachment 1331344 [details]
engine log for snapshot preview failure

Comment 4 Raz Tamir 2017-09-27 12:59:03 UTC
Raising severity as this affect more and more flows - updates will be added shortly

Comment 5 Mor 2017-09-27 13:03:34 UTC
Hi, I am also experiencing this issue on my env: 4.2.0-0.0.master.20170926175518.git0d20200.el7.centos
Got two tasks stuck on 'Executing' and 'Finalizing':
'Restoring VM Snapshot Active VM before the preview of VM mac_pool_vm_1'
'Preview VM Snapshot MAC_pool_snapshot_c9 of VM mac_pool_vm_0'

Comment 6 Mor 2017-09-27 13:04:32 UTC
Created attachment 1331434 [details]
engine.log

Comment 7 Michael Burman 2017-09-28 12:51:24 UTC
Hi
Any one going to tkae a look on this issue? it's becoming very serious and more and more jobs just get stuck and manual intevration is needed to kill the jobs in the DB. 
Another job that got stuck, add host...the jobs is stuck without any timeout or failure and i need to kill it in the DB.
Every job that got stuck leaving the system in very bad shape, and other operation are blocked because of it.

Comment 8 Michael Burman 2017-09-28 12:52:27 UTC
job_id                | 0061745a-a35e-4e6b-b717-afce428b7104
action_type           | AddVds
description           | Adding new Host orchid-vds2.qa.lab.tlv.redhat.com to Cluster Cluster_vlan_162
status                | STARTED
owner_id              | 586c19dc-00b9-00fa-0364-00000000012f
visible               | t
start_time            | 2017-09-28 15:39:27.604+03
end_time              | 
last_update_time      | 2017-09-28 15:39:29.432+03
correlation_id        | 063bdd33-709b-4b60-a828-954068bff5a7
is_external           | f
is_auto_cleared       | t
engine_session_seq_id | 7967


Now the time is 15:52:30 and the jobs still in installing state....(

Comment 9 Piotr Kliczewski 2017-09-29 07:28:22 UTC
It looks like in both engine logs I see:

2017-09-25 01:53:44,935+03 ERROR [org.ovirt.engine.core.bll.tasks.CommandAsyncTask] (org.ovirt.thread.EE-ManagedThreadFactory-engine-Thread-409) [] [within thread]: endAction for action type RemoveVm threw an exception.: javax.ejb.ConcurrentAccessTimeoutException: WFLYEJB0241: EJB 3.1 PFD2 4.8.5.5.1 concurrent access timeout on CoCoAsyncTaskHelper - could not obtain lock within 5000MILLISECONDS

2017-09-27 15:35:18,799+03 ERROR [org.ovirt.engine.core.bll.snapshots.RestoreFromSnapshotCommand] (default task-22) [vms_syncAction_9c198944-2a6e-4aaa] Exception: javax.ejb.ConcurrentAccessTimeoutException: WFLYEJB0241: EJB 3.1 PFD2 4.8.5.5.1 concurrent access timeout on CoCoAsyncTaskHelper - could not obtain lock within 5000MILLISECONDS

Checking whether there is known bug.

Comment 10 Michael Burman 2017-10-01 07:07:38 UTC
I see this same bug on rhevm-4.1.7.2-0.1.el7.noarch

[root@mburman-4-downstream ~]# rpm -q rhevm
rhevm-4.1.7.2-0.1.el7.noarch

[root@mburman-4-downstream ~]# sudo -u postgres psql
could not change directory to "/root"
psql (9.2.23)
Type "help" for help.

postgres=# \c engine
You are now connected to database "engine" as user "postgres".
engine=# \x
Expanded display is on.
engine=# select * from job;
-[ RECORD 1 ]---------+-------------------------------------
job_id                | dbdf2642-8770-48cc-8216-007bd5a5040f
action_type           | TryBackToAllSnapshotsOfVm
description           | Preview VM Snapshot snap3 of VM VM1
status                | STARTED
owner_id              | 585fd038-03b6-012b-002f-000000000300
visible               | t
start_time            | 2017-10-01 09:47:43.107+03
end_time              | 
last_update_time      | 2017-10-01 10:03:22.102+03
correlation_id        | 1f15395e-94ea-484b-a1fd-c7a536f42b0d
is_external           | f
is_auto_cleared       | t
engine_session_seq_id | 3579

engine=#

The job is stuck and needed to remove manually from DB.

Comment 11 Michael Burman 2017-10-01 07:08:37 UTC
Created attachment 1332877 [details]
engine log 4.1.7-2

Comment 12 Raz Tamir 2017-10-01 07:31:03 UTC
Remove disk operation failed to remove from the DB - ovirt-engine-4.2.0-0.0.master.20170925172449.git9d850a9.el7.centos.noarch:

2017-09-30 14:28:25,539+03 ERROR [org.ovirt.engine.core.bll.tasks.SPMAsyncTask] (org.ovirt.thread.EE-ManagedThreadFactory-engine-Thread-80) [66a7c65] BaseAsyncTask::removeTaskFromDB: Removing task 'c5a638ce-081b-40be-8a9a-6266681af815' from DataBase threw an exception: WFLYEJB0241: EJB 3.1 PFD2 4.8.5.5.1 concurrent access timeout on CoCoAsyncTaskHelper - could not obtain lock within 5000MILLISECONDS


and in DB:
engine=# select * from async_tasks;
-[ RECORD 1 ]---+-------------------------------------
task_id         | 76bc1207-c7af-4219-8344-f82f0c5b11b7
action_type     | 230
status          | 2
result          | 0
step_id         | 
command_id      | b112ed8c-4c2a-4a3f-a379-87aafcdf114e
started_at      | 2017-09-30 14:28:15.883+03
storage_pool_id | 3299475f-cba5-4f59-8042-c14b76bf6560
task_type       | 5
vdsm_task_id    | c5a638ce-081b-40be-8a9a-6266681af815
root_command_id | 93f52f27-c55d-403f-8f1a-4aca94dbc952
user_id         | 59cf4d46-0155-0389-0320-0000000002f4

No cleared yet (24 hours)

Comment 13 Michael Burman 2017-10-02 07:50:43 UTC
This bug should be cloned to 4.1.z , someone please can do that?

Comment 14 Oved Ourfali 2017-10-02 07:52:06 UTC
(In reply to Michael Burman from comment #13)
> This bug should be cloned to 4.1.z , someone please can do that?

Once fixed we will clone it.
Currently the fix isn't sufficient.

Comment 15 Allon Mureinik 2017-10-03 20:18:27 UTC
*** Bug 1497958 has been marked as a duplicate of this bug. ***

Comment 16 Ravi Nori 2017-10-04 16:38:02 UTC
I have posted a patch to fix the javax.ejb.ConcurrentAccessTimeoutException. The issue is in 4.2 only.

The problem reported in comment #2 is related to TryBackToAllSnapshotsOfVmCommand which is throwing a EngineException MAC_POOL_NO_MACS_LEFT in endSuccessfully. Created a new BZ 1498580 to track the issue, as it occurs in 4.1 and 4.2.

Comment 17 Martin Perina 2017-10-05 13:38:43 UTC
(In reply to Michael Burman from comment #7)
> Hi
> Any one going to tkae a look on this issue? it's becoming very serious and
> more and more jobs just get stuck and manual intevration is needed to kill
> the jobs in the DB. 
> Another job that got stuck, add host...the jobs is stuck without any timeout
> or failure and i need to kill it in the DB.
> Every job that got stuck leaving the system in very bad shape, and other
> operation are blocked because of it.

This is different issue and we have created BZ1498580 to track it, because it exists also in 4.1 and most probably also in 4.0.

This bug cover only injection issue mentioned in Comment 9 which is 4.2 issue only

Comment 18 Raz Tamir 2017-10-24 16:18:33 UTC
Michael,

Do you still see this issue?

Comment 19 Michael Burman 2017-10-25 06:10:36 UTC
(In reply to Raz Tamir from comment #18)
> Michael,
> 
> Do you still see this issue?

Hi Raz,

We have a new issue with preview snapshots, which a bit affecting the test. 
See BZ 1506092

I will run my use case to make sure it was fixed, any how my spesific scenario is tracked in BZ  1498580

Thanks,

Comment 20 Michael Burman 2017-10-25 06:27:09 UTC
Raz, i think that BZ 1506092 should be a blocker to this bug. It can't tested properly until fixed.

Comment 21 Michael Burman 2017-10-26 13:23:33 UTC
Hi Raz, 
Update from my side - now after BZ 1506092 is merged to master and fixed, i have tested my use case scenario - Snapshot preview failure leaves jobs running and image locked and it PASS. 
The preview failed as expected and the job has released as it should. 
I think that some other jobs should be tested as well, to make sure this report has been fixed properly.

Comment 22 Raz Tamir 2017-10-30 16:46:09 UTC
Verified on ovirt-engine-4.2.0-0.0.master.20171027213842.gitded437c.el7.centos

This issue is gone from our automation tests for 2 weeks now

Comment 23 Sandro Bonazzola 2017-12-20 11:38:01 UTC
This bugzilla is included in oVirt 4.2.0 release, published on Dec 20th 2017.

Since the problem described in this bug report should be
resolved in oVirt 4.2.0 release, published on Dec 20th 2017, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.