Bugzilla will be upgraded to version 5.0. The upgrade date is tentatively scheduled for 2 December 2018, pending final testing and feedback.
Bug 1342317 - [SCALE] DB deadlock in updatedisk_image_dynamic_by_disk_id_and_vm_id, slowing down engine.
[SCALE] DB deadlock in updatedisk_image_dynamic_by_disk_id_and_vm_id, slowing...
Status: CLOSED ERRATA
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-engine (Show other bugs)
3.5.7
x86_64 Linux
unspecified Severity high
: ovirt-4.0.1
: ---
Assigned To: Liron Aravot
Eldad Marciano
: ZStream
Depends On:
Blocks: 1350767
  Show dependency treegraph
 
Reported: 2016-06-02 20:39 EDT by Germano Veit Michel
Modified: 2017-03-07 07:37 EST (History)
14 users (show)

See Also:
Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1350767 (view as bug list)
Environment:
Last Closed: 2016-08-23 16:41:20 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: Storage
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)


External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Knowledge Base (Article) 2436171 None None None 2016-07-11 21:53 EDT
oVirt gerrit 58705 master MERGED core: db deadlock by disk image dynamic updates 2016-06-08 09:16 EDT
oVirt gerrit 58882 master MERGED core: DiskImageDynamicDao - deadlock caused by wrong sorting 2016-06-09 07:43 EDT
oVirt gerrit 58902 ovirt-engine-4.0 MERGED core: DiskImageDynamicDao - deadlock caused by wrong sorting 2016-06-09 09:02 EDT
oVirt gerrit 58915 ovirt-engine-3.6 MERGED core: DiskImageDynamicDao - deadlock caused by wrong sorting 2016-06-09 10:26 EDT
Red Hat Product Errata RHEA-2016:1743 normal SHIPPED_LIVE Red Hat Virtualization Manager 4.0 GA Enhancement (ovirt-engine) 2016-09-02 17:54:01 EDT

  None (edit)
Description Germano Veit Michel 2016-06-02 20:39:10 EDT
Description of problem:

After 10-15 days, the engine becomes so slow the GUI is unusable. It also fails to coordinate vds operations in a timely manner. I understand the slowness is due to DB deadlocks while updating disk_image_dynamic, but I might be wrong.

This was happening in earlier 3.5 release. Due to BZ1198674 and BZ1233194 (Z-Stream) we recommended upgrade to 3.5.8, but both the slowness and deadlock continues. This BZ related the following patch https://gerrit.ovirt.org/#/c/42403/ to the slowness of the engine.

I haven't recommended an upgrade to 3.6.6 because I could not find any related BZ for this in 3.6. So I'm filling this under 3.5.

Version-Release number of selected component (if applicable):
3.5.8

How reproducible:
0%

Actual results:
Engine slows down heavily within days. Needs to be restarted.

2016-05-18 05:58:25,591 ERROR [org.ovirt.engine.core.dal.dbbroker.BatchProcedureExecutionConnectionCallback] (DefaultQuartzScheduler_Worker-81) [74964c13] Can't execute batch: : java.sql.BatchUpdateException: Batch entry 0 select * from public.updatedisk_image_dynamic_by_disk_id_and_vm_id(CAST ('1271f586-3853-4688-ace5-cdd2fd0bba33' AS uuid),CAST ('b6e9ddaf-32eb-4a52-91a8-a6129dfbbda7' AS uuid),CAST ('0' AS int4),CAST ('0' AS int4),CAST ('42949672960' AS int8),CAST ('0.0' AS numeric),CAST ('0.0' AS numeric),CAST ('0.0' AS numeric)) as result was aborted.  Call getNextException to see the cause.
2016-05-18 05:58:25,593 ERROR [org.ovirt.engine.core.dal.dbbroker.BatchProcedureExecutionConnectionCallback] (DefaultQuartzScheduler_Worker-81) [74964c13] Can't execute batch. Next exception is: : org.postgresql.util.PSQLException: ERROR: deadlock detected
2016-05-18 05:58:25,604 ERROR [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (DefaultQuartzScheduler_Worker-81) [74964c13] ResourceManager::refreshVdsRunTimeInfo: Error: DeadlockLoserDataAccessException: ConnectionCallback; SQL []; ERROR: deadlock detected
PL/pgSQL function "updatedisk_image_dynamic_by_disk_id_and_vm_id" line 2 at SQL statement; nested exception is org.postgresql.util.PSQLException: ERROR: deadlock detected
2016-05-18 05:58:25,605 ERROR [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (DefaultQuartzScheduler_Worker-81) [74964c13] DeadlockLoserDataAccessException: ConnectionCallback; SQL []; ERROR: deadlock detected
PL/pgSQL function "updatedisk_image_dynamic_by_disk_id_and_vm_id" line 2 at SQL statement; nested exception is org.postgresql.util.PSQLException: ERROR: deadlock detected
PL/pgSQL function "updatedisk_image_dynamic_by_disk_id_and_vm_id" line 2 at SQL statement: org.springframework.dao.DeadlockLoserDataAccessException: ConnectionCallback; SQL []; ERROR: deadlock detected
PL/pgSQL function "updatedisk_image_dynamic_by_disk_id_and_vm_id" line 2 at SQL statement; nested exception is org.postgresql.util.PSQLException: ERROR: deadlock detected
Caused by: org.postgresql.util.PSQLException: ERROR: deadlock detected

ERROR:  deadlock detected
DETAIL:  Process 6675 waits for ShareLock on transaction 1005481826; blocked by process 6665.
	Process 6665 waits for ShareLock on transaction 1005481825; blocked by process 6675.
	Process 6675: select * from public.updatedisk_image_dynamic_by_disk_id_and_vm_id(CAST ($1 AS uuid),CAST ($2 AS uuid),CAST ($3 AS int4),CAST ($4 AS int4),CAST ($5 AS int8),CAST ($6 AS numeric),CAST ($7 AS numeric),CAST ($8 AS numeric)) as result
	Process 6665: select * from public.updatedisk_image_dynamic_by_disk_id_and_vm_id(CAST ($1 AS uuid),CAST ($2 AS uuid),CAST ($3 AS int4),CAST ($4 AS int4),CAST ($5 AS int8),CAST ($6 AS numeric),CAST ($7 AS numeric),CAST ($8 AS numeric)) as result
HINT:  See server log for query details.
CONTEXT:  SQL statement "UPDATE disk_image_dynamic SET read_rate =  $1 ,write_rate =  $2 ,actual_size =  $3 ,read_latency_seconds =  $4 ,write_latency_seconds =  $5 ,flush_latency_seconds =  $6 , _update_date = LOCALTIMESTAMP WHERE image_id in (SELECT distinct image_guid FROM images WHERE image_group_id =  $7  and active = true) AND EXISTS (SELECT 1 FROM vm_device vmd WHERE vmd.vm_id =  $8  AND vmd.device_id =  $7  AND vmd.snapshot_id is NULL)"
	PL/pgSQL function "updatedisk_image_dynamic_by_disk_id_and_vm_id" line 2 at SQL statement
STATEMENT:  select * from public.updatedisk_image_dynamic_by_disk_id_and_vm_id(CAST ($1 AS uuid),CAST ($2 AS uuid),CAST ($3 AS int4),CAST ($4 AS int4),CAST ($5 AS int8),CAST ($6 AS numeric),CAST ($7 AS numeric),CAST ($8 AS numeric)) as result

Expected results:

No deadlocks
Comment 9 Allon Mureinik 2016-06-05 07:31:15 EDT
Liron, could you take a look please?
Comment 10 Liron Aravot 2016-06-06 13:26:11 EDT
Hi Germano,
can you please attach a relevant engine log from the time you encountered the issue?

regardless, i'm working on a fix.

thanks,
Liron
Comment 11 Germano Veit Michel 2016-06-06 19:46:36 EDT
Hi Liron,

Yes, of course! Attached a bunch of stuff and forgot the one of the most important ones, sorry. I'm attaching it now.

And thank you for working on this.

Cheers,
Germano
Comment 13 Allon Mureinik 2016-06-19 10:35:08 EDT
Yaniv - this isn't a code change - it's a noticeable issue with a real user impact.
Comment 17 eberman 2016-07-13 04:38:22 EDT
created 10 7.1 rhel workstations VMs on 3.6.8 engine with 1 host holding the 10 vms

created 10 shareable disk attached to all of the 10 vms
after 17 H, runtime i still dont see any ERROR in the engine log

also tried:
working with one of the VMs from the engine web admin-respond time seems reasonable and console to the VM also responded and was able to work with it.
Comment 19 errata-xmlrpc 2016-08-23 16:41:20 EDT
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHEA-2016-1743.html

Note You need to log in before you can comment on or make changes to this bug.