1342317 – [SCALE] DB deadlock in updatedisk_image_dynamic_by_disk_id_and_vm_id, slowing down engine.

Bug 1342317 - [SCALE] DB deadlock in updatedisk_image_dynamic_by_disk_id_and_vm_id, slowing down engine.

Summary: [SCALE] DB deadlock in updatedisk_image_dynamic_by_disk_id_and_vm_id, slowing...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	ovirt-engine
Sub Component:
Version:	3.5.7
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	ovirt-4.0.1
Target Release:	---
Assignee:	Liron Aravot
QA Contact:	Eldad Marciano
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1350767
TreeView+	depends on / blocked

Reported:	2016-06-03 00:39 UTC by Germano Veit Michel
Modified:	2019-11-14 08:19 UTC (History)
CC List:	14 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1350767 (view as bug list)
Environment:
Last Closed:	2016-08-23 20:41:20 UTC
oVirt Team:	Storage
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Knowledge Base (Article)	2436171	None	None	None	2016-07-12 01:53:54 UTC
Red Hat Product Errata	RHEA-2016:1743	normal	SHIPPED_LIVE	Red Hat Virtualization Manager 4.0 GA Enhancement (ovirt-engine)	2016-09-02 21:54:01 UTC
oVirt gerrit	58705	'None'	MERGED	core: db deadlock by disk image dynamic updates	2020-05-27 07:01:51 UTC
oVirt gerrit	58882	'None'	MERGED	core: DiskImageDynamicDao - deadlock caused by wrong sorting	2020-05-27 07:01:51 UTC
oVirt gerrit	58902	'None'	MERGED	core: DiskImageDynamicDao - deadlock caused by wrong sorting	2020-05-27 07:01:52 UTC
oVirt gerrit	58915	'None'	MERGED	core: DiskImageDynamicDao - deadlock caused by wrong sorting	2020-05-27 07:01:52 UTC

Description Germano Veit Michel 2016-06-03 00:39:10 UTC

Description of problem:

After 10-15 days, the engine becomes so slow the GUI is unusable. It also fails to coordinate vds operations in a timely manner. I understand the slowness is due to DB deadlocks while updating disk_image_dynamic, but I might be wrong.

This was happening in earlier 3.5 release. Due to BZ1198674 and BZ1233194 (Z-Stream) we recommended upgrade to 3.5.8, but both the slowness and deadlock continues. This BZ related the following patch https://gerrit.ovirt.org/#/c/42403/ to the slowness of the engine.

I haven't recommended an upgrade to 3.6.6 because I could not find any related BZ for this in 3.6. So I'm filling this under 3.5.

Version-Release number of selected component (if applicable):
3.5.8

How reproducible:
0%

Actual results:
Engine slows down heavily within days. Needs to be restarted.

2016-05-18 05:58:25,591 ERROR [org.ovirt.engine.core.dal.dbbroker.BatchProcedureExecutionConnectionCallback] (DefaultQuartzScheduler_Worker-81) [74964c13] Can't execute batch: : java.sql.BatchUpdateException: Batch entry 0 select * from public.updatedisk_image_dynamic_by_disk_id_and_vm_id(CAST ('1271f586-3853-4688-ace5-cdd2fd0bba33' AS uuid),CAST ('b6e9ddaf-32eb-4a52-91a8-a6129dfbbda7' AS uuid),CAST ('0' AS int4),CAST ('0' AS int4),CAST ('42949672960' AS int8),CAST ('0.0' AS numeric),CAST ('0.0' AS numeric),CAST ('0.0' AS numeric)) as result was aborted.  Call getNextException to see the cause.
2016-05-18 05:58:25,593 ERROR [org.ovirt.engine.core.dal.dbbroker.BatchProcedureExecutionConnectionCallback] (DefaultQuartzScheduler_Worker-81) [74964c13] Can't execute batch. Next exception is: : org.postgresql.util.PSQLException: ERROR: deadlock detected
2016-05-18 05:58:25,604 ERROR [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (DefaultQuartzScheduler_Worker-81) [74964c13] ResourceManager::refreshVdsRunTimeInfo: Error: DeadlockLoserDataAccessException: ConnectionCallback; SQL []; ERROR: deadlock detected
PL/pgSQL function "updatedisk_image_dynamic_by_disk_id_and_vm_id" line 2 at SQL statement; nested exception is org.postgresql.util.PSQLException: ERROR: deadlock detected
2016-05-18 05:58:25,605 ERROR [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (DefaultQuartzScheduler_Worker-81) [74964c13] DeadlockLoserDataAccessException: ConnectionCallback; SQL []; ERROR: deadlock detected
PL/pgSQL function "updatedisk_image_dynamic_by_disk_id_and_vm_id" line 2 at SQL statement; nested exception is org.postgresql.util.PSQLException: ERROR: deadlock detected
PL/pgSQL function "updatedisk_image_dynamic_by_disk_id_and_vm_id" line 2 at SQL statement: org.springframework.dao.DeadlockLoserDataAccessException: ConnectionCallback; SQL []; ERROR: deadlock detected
PL/pgSQL function "updatedisk_image_dynamic_by_disk_id_and_vm_id" line 2 at SQL statement; nested exception is org.postgresql.util.PSQLException: ERROR: deadlock detected
Caused by: org.postgresql.util.PSQLException: ERROR: deadlock detected

ERROR:  deadlock detected
DETAIL:  Process 6675 waits for ShareLock on transaction 1005481826; blocked by process 6665.
	Process 6665 waits for ShareLock on transaction 1005481825; blocked by process 6675.
	Process 6675: select * from public.updatedisk_image_dynamic_by_disk_id_and_vm_id(CAST ($1 AS uuid),CAST ($2 AS uuid),CAST ($3 AS int4),CAST ($4 AS int4),CAST ($5 AS int8),CAST ($6 AS numeric),CAST ($7 AS numeric),CAST ($8 AS numeric)) as result
	Process 6665: select * from public.updatedisk_image_dynamic_by_disk_id_and_vm_id(CAST ($1 AS uuid),CAST ($2 AS uuid),CAST ($3 AS int4),CAST ($4 AS int4),CAST ($5 AS int8),CAST ($6 AS numeric),CAST ($7 AS numeric),CAST ($8 AS numeric)) as result
HINT:  See server log for query details.
CONTEXT:  SQL statement "UPDATE disk_image_dynamic SET read_rate =  $1 ,write_rate =  $2 ,actual_size =  $3 ,read_latency_seconds =  $4 ,write_latency_seconds =  $5 ,flush_latency_seconds =  $6 , _update_date = LOCALTIMESTAMP WHERE image_id in (SELECT distinct image_guid FROM images WHERE image_group_id =  $7  and active = true) AND EXISTS (SELECT 1 FROM vm_device vmd WHERE vmd.vm_id =  $8  AND vmd.device_id =  $7  AND vmd.snapshot_id is NULL)"
	PL/pgSQL function "updatedisk_image_dynamic_by_disk_id_and_vm_id" line 2 at SQL statement
STATEMENT:  select * from public.updatedisk_image_dynamic_by_disk_id_and_vm_id(CAST ($1 AS uuid),CAST ($2 AS uuid),CAST ($3 AS int4),CAST ($4 AS int4),CAST ($5 AS int8),CAST ($6 AS numeric),CAST ($7 AS numeric),CAST ($8 AS numeric)) as result

Expected results:

No deadlocks

Comment 9 Allon Mureinik 2016-06-05 11:31:15 UTC

Liron, could you take a look please?

Comment 10 Liron Aravot 2016-06-06 17:26:11 UTC

Hi Germano,
can you please attach a relevant engine log from the time you encountered the issue?

regardless, i'm working on a fix.

thanks,
Liron

Comment 11 Germano Veit Michel 2016-06-06 23:46:36 UTC

Hi Liron,

Yes, of course! Attached a bunch of stuff and forgot the one of the most important ones, sorry. I'm attaching it now.

And thank you for working on this.

Cheers,
Germano

Comment 13 Allon Mureinik 2016-06-19 14:35:08 UTC

Yaniv - this isn't a code change - it's a noticeable issue with a real user impact.

Comment 17 eberman 2016-07-13 08:38:22 UTC

created 10 7.1 rhel workstations VMs on 3.6.8 engine with 1 host holding the 10 vms

created 10 shareable disk attached to all of the 10 vms
after 17 H, runtime i still dont see any ERROR in the engine log

also tried:
working with one of the VMs from the engine web admin-respond time seems reasonable and console to the VM also responded and was able to work with it.

Comment 19 errata-xmlrpc 2016-08-23 20:41:20 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHEA-2016-1743.html

Note You need to log in before you can comment on or make changes to this bug.