Description of problem: After 10-15 days, the engine becomes so slow the GUI is unusable. It also fails to coordinate vds operations in a timely manner. I understand the slowness is due to DB deadlocks while updating disk_image_dynamic, but I might be wrong. This was happening in earlier 3.5 release. Due to BZ1198674 and BZ1233194 (Z-Stream) we recommended upgrade to 3.5.8, but both the slowness and deadlock continues. This BZ related the following patch https://gerrit.ovirt.org/#/c/42403/ to the slowness of the engine. I haven't recommended an upgrade to 3.6.6 because I could not find any related BZ for this in 3.6. So I'm filling this under 3.5. Version-Release number of selected component (if applicable): 3.5.8 How reproducible: 0% Actual results: Engine slows down heavily within days. Needs to be restarted. 2016-05-18 05:58:25,591 ERROR [org.ovirt.engine.core.dal.dbbroker.BatchProcedureExecutionConnectionCallback] (DefaultQuartzScheduler_Worker-81) [74964c13] Can't execute batch: : java.sql.BatchUpdateException: Batch entry 0 select * from public.updatedisk_image_dynamic_by_disk_id_and_vm_id(CAST ('1271f586-3853-4688-ace5-cdd2fd0bba33' AS uuid),CAST ('b6e9ddaf-32eb-4a52-91a8-a6129dfbbda7' AS uuid),CAST ('0' AS int4),CAST ('0' AS int4),CAST ('42949672960' AS int8),CAST ('0.0' AS numeric),CAST ('0.0' AS numeric),CAST ('0.0' AS numeric)) as result was aborted. Call getNextException to see the cause. 2016-05-18 05:58:25,593 ERROR [org.ovirt.engine.core.dal.dbbroker.BatchProcedureExecutionConnectionCallback] (DefaultQuartzScheduler_Worker-81) [74964c13] Can't execute batch. Next exception is: : org.postgresql.util.PSQLException: ERROR: deadlock detected 2016-05-18 05:58:25,604 ERROR [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (DefaultQuartzScheduler_Worker-81) [74964c13] ResourceManager::refreshVdsRunTimeInfo: Error: DeadlockLoserDataAccessException: ConnectionCallback; SQL []; ERROR: deadlock detected PL/pgSQL function "updatedisk_image_dynamic_by_disk_id_and_vm_id" line 2 at SQL statement; nested exception is org.postgresql.util.PSQLException: ERROR: deadlock detected 2016-05-18 05:58:25,605 ERROR [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (DefaultQuartzScheduler_Worker-81) [74964c13] DeadlockLoserDataAccessException: ConnectionCallback; SQL []; ERROR: deadlock detected PL/pgSQL function "updatedisk_image_dynamic_by_disk_id_and_vm_id" line 2 at SQL statement; nested exception is org.postgresql.util.PSQLException: ERROR: deadlock detected PL/pgSQL function "updatedisk_image_dynamic_by_disk_id_and_vm_id" line 2 at SQL statement: org.springframework.dao.DeadlockLoserDataAccessException: ConnectionCallback; SQL []; ERROR: deadlock detected PL/pgSQL function "updatedisk_image_dynamic_by_disk_id_and_vm_id" line 2 at SQL statement; nested exception is org.postgresql.util.PSQLException: ERROR: deadlock detected Caused by: org.postgresql.util.PSQLException: ERROR: deadlock detected ERROR: deadlock detected DETAIL: Process 6675 waits for ShareLock on transaction 1005481826; blocked by process 6665. Process 6665 waits for ShareLock on transaction 1005481825; blocked by process 6675. Process 6675: select * from public.updatedisk_image_dynamic_by_disk_id_and_vm_id(CAST ($1 AS uuid),CAST ($2 AS uuid),CAST ($3 AS int4),CAST ($4 AS int4),CAST ($5 AS int8),CAST ($6 AS numeric),CAST ($7 AS numeric),CAST ($8 AS numeric)) as result Process 6665: select * from public.updatedisk_image_dynamic_by_disk_id_and_vm_id(CAST ($1 AS uuid),CAST ($2 AS uuid),CAST ($3 AS int4),CAST ($4 AS int4),CAST ($5 AS int8),CAST ($6 AS numeric),CAST ($7 AS numeric),CAST ($8 AS numeric)) as result HINT: See server log for query details. CONTEXT: SQL statement "UPDATE disk_image_dynamic SET read_rate = $1 ,write_rate = $2 ,actual_size = $3 ,read_latency_seconds = $4 ,write_latency_seconds = $5 ,flush_latency_seconds = $6 , _update_date = LOCALTIMESTAMP WHERE image_id in (SELECT distinct image_guid FROM images WHERE image_group_id = $7 and active = true) AND EXISTS (SELECT 1 FROM vm_device vmd WHERE vmd.vm_id = $8 AND vmd.device_id = $7 AND vmd.snapshot_id is NULL)" PL/pgSQL function "updatedisk_image_dynamic_by_disk_id_and_vm_id" line 2 at SQL statement STATEMENT: select * from public.updatedisk_image_dynamic_by_disk_id_and_vm_id(CAST ($1 AS uuid),CAST ($2 AS uuid),CAST ($3 AS int4),CAST ($4 AS int4),CAST ($5 AS int8),CAST ($6 AS numeric),CAST ($7 AS numeric),CAST ($8 AS numeric)) as result Expected results: No deadlocks
Liron, could you take a look please?
Hi Germano, can you please attach a relevant engine log from the time you encountered the issue? regardless, i'm working on a fix. thanks, Liron
Hi Liron, Yes, of course! Attached a bunch of stuff and forgot the one of the most important ones, sorry. I'm attaching it now. And thank you for working on this. Cheers, Germano
Yaniv - this isn't a code change - it's a noticeable issue with a real user impact.
created 10 7.1 rhel workstations VMs on 3.6.8 engine with 1 host holding the 10 vms created 10 shareable disk attached to all of the 10 vms after 17 H, runtime i still dont see any ERROR in the engine log also tried: working with one of the VMs from the engine web admin-respond time seems reasonable and console to the VM also responded and was able to work with it.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHEA-2016-1743.html