Bug 1568075 - [downstream clone - 4.1.11] HotUnPlugDisk triggers SQL deadlock
Summary: [downstream clone - 4.1.11] HotUnPlugDisk triggers SQL deadlock
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-engine
Version: 4.1.10
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ovirt-4.1.11
: ---
Assignee: Daniel Erez
QA Contact: Elad
URL:
Whiteboard:
Depends On: 1564069
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-04-16 17:37 UTC by RHV bug bot
Modified: 2019-05-16 13:08 UTC (History)
20 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 1564069
Environment:
Last Closed: 2018-04-24 15:30:28 UTC
oVirt Team: Storage
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 3410751 0 None None None 2018-04-16 17:39:50 UTC
Red Hat Product Errata RHBA-2018:1219 0 None None None 2018-04-24 15:31:17 UTC
oVirt gerrit 90337 0 ovirt-engine-4.1 MERGED core: do not persist VmDevice#bootOrder 2018-04-17 08:58:23 UTC
oVirt gerrit 90338 0 ovirt-engine-4.1 ABANDONED core: always compute boot order on run vm 2018-04-16 17:39:50 UTC

Description RHV bug bot 2018-04-16 17:37:38 UTC
+++ This bug is a downstream clone. The original bug is: +++
+++   bug 1564069 +++
======================================================================

Created attachment 1417608 [details]
Full stacktrace as logged in engine.log

Description of problem:

A HotUpPlugDisk operation may trigger an SQL deadlock. These are the related messages logged on engine.log (full stacktrace attached to BZ):

2018-04-04 17:17:08,740Z INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.HotUnPlugDiskVDSCommand] (defaul
t task-42) [dfd9e427-683c-41bf-afc0-370b12de0581] FINISH, HotUnPlugDiskVDSCommand, log id: 63104284
2018-04-04 17:17:09,760Z ERROR [org.ovirt.engine.core.dal.dbbroker.BatchProcedureExecutionConnectionCallback] (default task-42) [dfd9e427-683c-41bf-afc0-370b12de0581] Can't execute batch: Batch entry 0 select * from public.updatevmdeviceb
ootorder(CAST ('87880aaf-d326-4253-8636-026f53e6ce6b' AS uuid),CAST ('1bfe3e00-0b33-41f6-b413-dc64b9eda5b8' AS uuid),CAST (0 AS int4)) as result was aborted.  Call getNextException to see the cause.
2018-04-04 17:17:09,760Z ERROR [org.ovirt.engine.core.dal.dbbroker.BatchProcedureExecutionConnectionCallback] (default task-42) [dfd9e427-683c-41bf-afc0-370b12de0581] Can't execute batch. Next exception is: ERROR: deadlock detected
  Detail: Process 23976 waits for ShareLock on transaction 37761211; blocked by process 24090.
Process 24090 waits for ShareLock on transaction 37761182; blocked by process 23976.
  Hint: See server log for query details.
  Where: SQL statement "UPDATE vm_device
    SET boot_order = v_boot_order,
        _update_date = current_timestamp
    WHERE device_id = v_device_id
        AND vm_id = v_vm_id"
PL/pgSQL function updatevmdevicebootorder(uuid,uuid,integer) line 3 at SQL statement
2018-04-04 17:17:09,761Z ERROR [org.ovirt.engine.core.bll.storage.disk.DetachDiskFromVmCommand] (default task-42) [dfd9e427-683c-41bf-afc0-370b12de0581] Command 'org.ovirt.engine.core.bll.storage.disk.DetachDiskFromVmCommand' failed: Conn
ectionCallback; SQL []; ERROR: deadlock detected
  Detail: Process 23976 waits for ShareLock on transaction 37761211; blocked by process 24090.
Process 24090 waits for ShareLock on transaction 37761182; blocked by process 23976.
  Hint: See server log for query details.
  Where: SQL statement "UPDATE vm_device
    SET boot_order = v_boot_order,
        _update_date = current_timestamp
    WHERE device_id = v_device_id
        AND vm_id = v_vm_id"
PL/pgSQL function updatevmdevicebootorder(uuid,uuid,integer) line 3 at SQL statement; nested exception is org.postgresql.util.PSQLException: ERROR: deadlock detected
  Detail: Process 23976 waits for ShareLock on transaction 37761211; blocked by process 24090.
Process 24090 waits for ShareLock on transaction 37761182; blocked by process 23976.
  Hint: See server log for query details.
  Where: SQL statement "UPDATE vm_device
    SET boot_order = v_boot_order,
        _update_date = current_timestamp
    WHERE device_id = v_device_id
        AND vm_id = v_vm_id"
PL/pgSQL function updatevmdevicebootorder(uuid,uuid,integer) line 3 at SQL statement
2018-04-04 17:17:09,761Z ERROR [org.ovirt.engine.core.bll.storage.disk.DetachDiskFromVmCommand] (default task-42) [dfd9e427-683c-41bf-afc0-370b12de0581] Exception: org.springframework.dao.DeadlockLoserDataAccessException: ConnectionCallba
ck; SQL []; ERROR: deadlock detected
  Detail: Process 23976 waits for ShareLock on transaction 37761211; blocked by process 24090.
Process 24090 waits for ShareLock on transaction 37761182; blocked by process 23976.
  Hint: See server log for query details.
  Where: SQL statement "UPDATE vm_device
    SET boot_order = v_boot_order,
        _update_date = current_timestamp
    WHERE device_id = v_device_id
        AND vm_id = v_vm_id"
PL/pgSQL function updatevmdevicebootorder(uuid,uuid,integer) line 3 at SQL statement; nested exception is org.postgresql.util.PSQLException: ERROR: deadlock detected
  Detail: Process 23976 waits for ShareLock on transaction 37761211; blocked by process 24090.
Process 24090 waits for ShareLock on transaction 37761182; blocked by process 23976.
  Hint: See server log for query details.
  Where: SQL statement "UPDATE vm_device
    SET boot_order = v_boot_order,
        _update_date = current_timestamp
    WHERE device_id = v_device_id
        AND vm_id = v_vm_id"
PL/pgSQL function updatevmdevicebootorder(uuid,uuid,integer) line 3 at SQL statement
        at org.springframework.jdbc.support.SQLErrorCodeSQLExceptionTranslator.doTranslate(SQLErrorCodeSQLExceptionTranslator.java:263) [spring-jdbc.jar:4.2.4.RELEASE]
        at org.springframework.jdbc.support.AbstractFallbackSQLExceptionTranslator.translate(AbstractFallbackSQLExceptionTranslator.java:73) [spring-jdbc.jar:4.2.4.RELEASE]
        at org.springframework.jdbc.core.JdbcTemplate.execute(JdbcTemplate.java:356) [spring-jdbc.jar:4.2.4.RELEASE]
        at org.ovirt.engine.core.dal.dbbroker.SimpleJdbcCallsHandler.executeStoredProcAsBatch(SimpleJdbcCallsHandler.java:58) [dal.jar:]
        at org.ovirt.engine.core.dal.dbbroker.SimpleJdbcCallsHandler.executeStoredProcAsBatch(SimpleJdbcCallsHandler.java:72) [dal.jar:]
        at org.ovirt.engine.core.dao.MassOperationsGenericDao.updateAllInBatch(MassOperationsGenericDao.java:60) [dal.jar:]
        at org.ovirt.engine.core.dao.VmDeviceDaoImpl.updateBootOrderInBatch(VmDeviceDaoImpl.java:272) [dal.jar:]
        at org.ovirt.engine.core.bll.utils.VmDeviceUtils.updateBootOrder(VmDeviceUtils.java:1205) [bll.jar:]
        at org.ovirt.engine.core.bll.storage.disk.DetachDiskFromVmCommand.executeVmCommand(DetachDiskFromVmCommand.java:96) [bll.jar:]
        at org.ovirt.engine.core.bll.VmCommand.executeCommand(VmCommand.java:120) [bll.jar:]
        at org.ovirt.engine.core.bll.CommandBase.executeWithoutTransaction(CommandBase.java:1255) [bll.jar:]
        at org.ovirt.engine.core.bll.CommandBase.executeActionInTransactionScope(CommandBase.java:1395) [bll.jar:]
        at org.ovirt.engine.core.bll.CommandBase.runInTransaction(CommandBase.java:2059) [bll.jar:]
        at org.ovirt.engine.core.utils.transaction.TransactionSupport.executeInNewTransaction(TransactionSupport.java:202) [utils.jar:]
        at org.ovirt.engine.core.utils.transaction.TransactionSupport.executeInRequired(TransactionSupport.java:137) [utils.jar:]
        at org.ovirt.engine.core.utils.transaction.TransactionSupport.executeInScope(TransactionSupport.java:105) [utils.jar:]
        at org.ovirt.engine.core.bll.CommandBase.execute(CommandBase.java:1455) [bll.jar:]
        at org.ovirt.engine.core.bll.CommandBase.executeAction(CommandBase.java:397) [bll.jar:]
        at org.ovirt.engine.core.bll.executor.DefaultBackendActionExecutor.execute(DefaultBackendActionExecutor.java:13) [bll.jar:]
        at org.ovirt.engine.core.bll.Backend.runAction(Backend.java:516) [bll.jar:]
        at org.ovirt.engine.core.bll.Backend.runActionImpl(Backend.java:498) [bll.jar:]
        at org.ovirt.engine.core.bll.Backend.runAction(Backend.java:451) [bll.jar:]
(...)
Caused by: org.postgresql.util.PSQLException: ERROR: deadlock detected
  Detail: Process 23976 waits for ShareLock on transaction 37761211; blocked by process 24090.
Process 24090 waits for ShareLock on transaction 37761182; blocked by process 23976.
  Hint: See server log for query details.
  Where: SQL statement "UPDATE vm_device
    SET boot_order = v_boot_order,
        _update_date = current_timestamp
    WHERE device_id = v_device_id
        AND vm_id = v_vm_id"
PL/pgSQL function updatevmdevicebootorder(uuid,uuid,integer) line 3 at SQL statement
        at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2157)
        at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:1886)
        at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:405)
        at org.postgresql.jdbc2.AbstractJdbc2Statement.executeBatch(AbstractJdbc2Statement.java:2893)
        at org.jboss.jca.adapters.jdbc.CachedPreparedStatement.executeBatch(CachedPreparedStatement.java:714)
        at org.jboss.jca.adapters.jdbc.WrappedStatement.executeBatch(WrappedStatement.java:1180)
        at org.ovirt.engine.core.dal.dbbroker.BatchProcedureExecutionConnectionCallback.doInConnection(BatchProcedureExecutionConnectionCallback.java:62) [dal.jar:]
        at org.springframework.jdbc.core.JdbcTemplate.execute(JdbcTemplate.java:349) [spring-jdbc.jar:4.2.4.RELEASE]
        ... 191 more

2018-04-04 17:17:09,788Z ERROR [org.ovirt.engine.core.bll.storage.disk.DetachDiskFromVmCommand] (default task-42) [dfd9e427-683c-41bf-afc0-370b12de0581] Transaction rolled-back for command 'org.ovirt.engine.core.bll.storage.disk.DetachDiskFromVmCommand'.
2018-04-04 17:17:09,803Z ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (default task-42) [dfd9e427-683c-41bf-afc0-370b12de0581] EVENT_ID: USER_FAILED_DETACH_DISK_FROM_VM(2,019), Correlation ID: dfd9e427-683c-41bf-afc0-370b12de0581, Call Stack: null, Custom ID: null, Custom Event ID: -1, Message: Failed to detach Disk hds-monitor2_Disk1 from VM backup-rhv (User: admin@internal-authz).
2018-04-04 17:17:09,804Z ERROR [org.ovirt.engine.api.restapi.resource.AbstractBackendResource] (default task-42) [] Operation Failed: [Internal Engine Error]


And this is the information logged by PostgreSQL:

ERROR:  deadlock detected
DETAIL:  Process 23976 waits for ShareLock on transaction 37761211; blocked by process 24090.
	Process 24090 waits for ShareLock on transaction 37761182; blocked by process 23976.
	Process 23976: select * from public.updatevmdevicebootorder(CAST ($1 AS uuid),CAST ($2 AS uuid),CAST ($3 AS int4)) as result
	Process 24090: select * from public.updatevmdevice(CAST ($1 AS uuid),CAST ($2 AS uuid),CAST ($3 AS varchar),CAST ($4 AS varchar),CAST ($5 AS varchar),CAST ($6 AS int4),CAST ($7 AS text),CAST ($8 AS bool),CAST ($9 AS bool),CAST ($10 AS bool),CAST ($11 AS varchar),CAST ($12 AS text),CAST ($13 AS uuid),CAST ($14 AS varchar),CAST ($15 AS varchar)) as result
HINT:  See server log for query details.
CONTEXT:  SQL statement "UPDATE vm_device
	    SET boot_order = v_boot_order,
	        _update_date = current_timestamp
	    WHERE device_id = v_device_id
	        AND vm_id = v_vm_id"
	PL/pgSQL function updatevmdevicebootorder(uuid,uuid,integer) line 3 at SQL statement
STATEMENT:  select * from public.updatevmdevicebootorder(CAST ($1 AS uuid),CAST ($2 AS uuid),CAST ($3 AS int4)) as result


Looks like the execution of multiple UpdateVmDeviceBootOrder in a batch, may collide with another UpdateVmDevice running in parallel. 


Version-Release number of selected component (if applicable):

ovirt-engine-4.1.10.3-0.1.el7.noarch


How reproducible:

Quite frequently while running a Commvault VM backup process.

(Originally by Sergio Lopez Pascual)

Comment 4 RHV bug bot 2018-04-16 17:38:00 UTC
The issue should be already resolved in 4.2 as part of:
https://gerrit.ovirt.org/#/c/71346/
https://gerrit.ovirt.org/#/c/71333/

Since the change seems too major for backporting to 4.1, moving to ON_QA for verification.

(Originally by Daniel Erez)

Comment 10 RHV bug bot 2018-04-16 17:38:32 UTC
Has QE reproduced this?

(Originally by Yaniv Kaul)

Comment 11 RHV bug bot 2018-04-16 17:38:39 UTC
I don't see clear steps to reproduce here so I can't tell if we have this tested

(Originally by Elad Ben Aharon)

Comment 19 RHV bug bot 2018-04-16 17:39:25 UTC
WARN: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason:

[Found non-acked flags: '{'rhevm-4.2-ga': '?'}', ]

For more info please contact: rhv-devops@redhat.comINFO: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason:

[Found non-acked flags: '{'rhevm-4.2-ga': '?'}', ]

For more info please contact: rhv-devops@redhat.com

(Originally by rhv-bugzilla-bot)

Comment 20 RHV bug bot 2018-04-16 17:39:31 UTC
Execute all disk hotplug automation tests for verification.

(Originally by Elad Ben Aharon)

Comment 24 Raz Tamir 2018-04-23 12:36:05 UTC
acked

Comment 25 Elad 2018-04-23 16:08:20 UTC
Executed storage automation hotplug/unplug disk sanity over iscsi and nfs. Passed
For a reference, these are the tests cases in Polarion that were executed:

TestCase5033
TestCase5034
TestCase5035
TestCase5036
TestCase5041
TestCase6231


Used:
rhevm-4.1.11.2-0.1.el7.noarch
vdsm-4.19.51-1.el7ev.x86_64
libvirt-daemon-3.9.0-14.el7_5.3.x86_64

Comment 26 Arik 2018-04-23 16:19:27 UTC
(In reply to Elad from comment #25)
> Executed storage automation hotplug/unplug disk sanity over iscsi and nfs.
> Passed
> For a reference, these are the tests cases in Polarion that were executed:
> 
> TestCase5033
> TestCase5034
> TestCase5035
> TestCase5036
> TestCase5041
> TestCase6231
> 
> 
> Used:
> rhevm-4.1.11.2-0.1.el7.noarch
> vdsm-4.19.51-1.el7ev.x86_64
> libvirt-daemon-3.9.0-14.el7_5.3.x86_64

The fix also changes the way the boot sequence is set on run-VM.
The code that has been backported works fine on the 4.2 branch for quite some time, but don't we want to have at least some minimal verification it also works as expected on 4.1?

Comment 29 errata-xmlrpc 2018-04-24 15:30:28 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:1219

Comment 30 Israel Pinto 2018-05-16 13:57:49 UTC
(In reply to Arik from comment #26)
> (In reply to Elad from comment #25)
> > Executed storage automation hotplug/unplug disk sanity over iscsi and nfs.
> > Passed
> > For a reference, these are the tests cases in Polarion that were executed:
> > 
> > TestCase5033
> > TestCase5034
> > TestCase5035
> > TestCase5036
> > TestCase5041
> > TestCase6231
> > 
> > 
> > Used:
> > rhevm-4.1.11.2-0.1.el7.noarch
> > vdsm-4.19.51-1.el7ev.x86_64
> > libvirt-daemon-3.9.0-14.el7_5.3.x86_64
> 
> The fix also changes the way the boot sequence is set on run-VM.
> The code that has been backported works fine on the 4.2 branch for quite
> some time, but don't we want to have at least some minimal verification it
> also works as expected on 4.1?

Verify with run once and start VM all 3 options: HD, DC, PXE - ALL PASS

Comment 31 Franta Kust 2019-05-16 13:08:27 UTC
BZ<2>Jira Resync


Note You need to log in before you can comment on or make changes to this bug.