Bug 1564069 - HotUnPlugDisk triggers SQL deadlock
Summary: HotUnPlugDisk triggers SQL deadlock
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-engine
Version: 4.1.10
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ovirt-4.2.3
: ---
Assignee: Daniel Erez
QA Contact: Elad
URL:
Whiteboard:
Depends On:
Blocks: 1568075
TreeView+ depends on / blocked
 
Reported: 2018-04-05 09:35 UTC by Sergio Lopez
Modified: 2019-05-16 13:06 UTC (History)
14 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1568075 (view as bug list)
Environment:
Last Closed: 2018-05-15 17:48:41 UTC
oVirt Team: Storage
Target Upstream Version:


Attachments (Terms of Use)
Full stacktrace as logged in engine.log (28.07 KB, text/plain)
2018-04-05 09:35 UTC, Sergio Lopez
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 3410751 0 None None None 2018-04-13 05:12:17 UTC
Red Hat Product Errata RHEA-2018:1488 0 None None None 2018-05-15 17:49:12 UTC
oVirt gerrit 90337 0 ovirt-engine-4.1 MERGED core: do not persist VmDevice#bootOrder 2020-03-04 16:41:05 UTC
oVirt gerrit 90338 0 ovirt-engine-4.1 ABANDONED core: always compute boot order on run vm 2020-03-04 16:41:05 UTC

Description Sergio Lopez 2018-04-05 09:35:52 UTC
Created attachment 1417608 [details]
Full stacktrace as logged in engine.log

Description of problem:

A HotUpPlugDisk operation may trigger an SQL deadlock. These are the related messages logged on engine.log (full stacktrace attached to BZ):

2018-04-04 17:17:08,740Z INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.HotUnPlugDiskVDSCommand] (defaul
t task-42) [dfd9e427-683c-41bf-afc0-370b12de0581] FINISH, HotUnPlugDiskVDSCommand, log id: 63104284
2018-04-04 17:17:09,760Z ERROR [org.ovirt.engine.core.dal.dbbroker.BatchProcedureExecutionConnectionCallback] (default task-42) [dfd9e427-683c-41bf-afc0-370b12de0581] Can't execute batch: Batch entry 0 select * from public.updatevmdeviceb
ootorder(CAST ('87880aaf-d326-4253-8636-026f53e6ce6b' AS uuid),CAST ('1bfe3e00-0b33-41f6-b413-dc64b9eda5b8' AS uuid),CAST (0 AS int4)) as result was aborted.  Call getNextException to see the cause.
2018-04-04 17:17:09,760Z ERROR [org.ovirt.engine.core.dal.dbbroker.BatchProcedureExecutionConnectionCallback] (default task-42) [dfd9e427-683c-41bf-afc0-370b12de0581] Can't execute batch. Next exception is: ERROR: deadlock detected
  Detail: Process 23976 waits for ShareLock on transaction 37761211; blocked by process 24090.
Process 24090 waits for ShareLock on transaction 37761182; blocked by process 23976.
  Hint: See server log for query details.
  Where: SQL statement "UPDATE vm_device
    SET boot_order = v_boot_order,
        _update_date = current_timestamp
    WHERE device_id = v_device_id
        AND vm_id = v_vm_id"
PL/pgSQL function updatevmdevicebootorder(uuid,uuid,integer) line 3 at SQL statement
2018-04-04 17:17:09,761Z ERROR [org.ovirt.engine.core.bll.storage.disk.DetachDiskFromVmCommand] (default task-42) [dfd9e427-683c-41bf-afc0-370b12de0581] Command 'org.ovirt.engine.core.bll.storage.disk.DetachDiskFromVmCommand' failed: Conn
ectionCallback; SQL []; ERROR: deadlock detected
  Detail: Process 23976 waits for ShareLock on transaction 37761211; blocked by process 24090.
Process 24090 waits for ShareLock on transaction 37761182; blocked by process 23976.
  Hint: See server log for query details.
  Where: SQL statement "UPDATE vm_device
    SET boot_order = v_boot_order,
        _update_date = current_timestamp
    WHERE device_id = v_device_id
        AND vm_id = v_vm_id"
PL/pgSQL function updatevmdevicebootorder(uuid,uuid,integer) line 3 at SQL statement; nested exception is org.postgresql.util.PSQLException: ERROR: deadlock detected
  Detail: Process 23976 waits for ShareLock on transaction 37761211; blocked by process 24090.
Process 24090 waits for ShareLock on transaction 37761182; blocked by process 23976.
  Hint: See server log for query details.
  Where: SQL statement "UPDATE vm_device
    SET boot_order = v_boot_order,
        _update_date = current_timestamp
    WHERE device_id = v_device_id
        AND vm_id = v_vm_id"
PL/pgSQL function updatevmdevicebootorder(uuid,uuid,integer) line 3 at SQL statement
2018-04-04 17:17:09,761Z ERROR [org.ovirt.engine.core.bll.storage.disk.DetachDiskFromVmCommand] (default task-42) [dfd9e427-683c-41bf-afc0-370b12de0581] Exception: org.springframework.dao.DeadlockLoserDataAccessException: ConnectionCallba
ck; SQL []; ERROR: deadlock detected
  Detail: Process 23976 waits for ShareLock on transaction 37761211; blocked by process 24090.
Process 24090 waits for ShareLock on transaction 37761182; blocked by process 23976.
  Hint: See server log for query details.
  Where: SQL statement "UPDATE vm_device
    SET boot_order = v_boot_order,
        _update_date = current_timestamp
    WHERE device_id = v_device_id
        AND vm_id = v_vm_id"
PL/pgSQL function updatevmdevicebootorder(uuid,uuid,integer) line 3 at SQL statement; nested exception is org.postgresql.util.PSQLException: ERROR: deadlock detected
  Detail: Process 23976 waits for ShareLock on transaction 37761211; blocked by process 24090.
Process 24090 waits for ShareLock on transaction 37761182; blocked by process 23976.
  Hint: See server log for query details.
  Where: SQL statement "UPDATE vm_device
    SET boot_order = v_boot_order,
        _update_date = current_timestamp
    WHERE device_id = v_device_id
        AND vm_id = v_vm_id"
PL/pgSQL function updatevmdevicebootorder(uuid,uuid,integer) line 3 at SQL statement
        at org.springframework.jdbc.support.SQLErrorCodeSQLExceptionTranslator.doTranslate(SQLErrorCodeSQLExceptionTranslator.java:263) [spring-jdbc.jar:4.2.4.RELEASE]
        at org.springframework.jdbc.support.AbstractFallbackSQLExceptionTranslator.translate(AbstractFallbackSQLExceptionTranslator.java:73) [spring-jdbc.jar:4.2.4.RELEASE]
        at org.springframework.jdbc.core.JdbcTemplate.execute(JdbcTemplate.java:356) [spring-jdbc.jar:4.2.4.RELEASE]
        at org.ovirt.engine.core.dal.dbbroker.SimpleJdbcCallsHandler.executeStoredProcAsBatch(SimpleJdbcCallsHandler.java:58) [dal.jar:]
        at org.ovirt.engine.core.dal.dbbroker.SimpleJdbcCallsHandler.executeStoredProcAsBatch(SimpleJdbcCallsHandler.java:72) [dal.jar:]
        at org.ovirt.engine.core.dao.MassOperationsGenericDao.updateAllInBatch(MassOperationsGenericDao.java:60) [dal.jar:]
        at org.ovirt.engine.core.dao.VmDeviceDaoImpl.updateBootOrderInBatch(VmDeviceDaoImpl.java:272) [dal.jar:]
        at org.ovirt.engine.core.bll.utils.VmDeviceUtils.updateBootOrder(VmDeviceUtils.java:1205) [bll.jar:]
        at org.ovirt.engine.core.bll.storage.disk.DetachDiskFromVmCommand.executeVmCommand(DetachDiskFromVmCommand.java:96) [bll.jar:]
        at org.ovirt.engine.core.bll.VmCommand.executeCommand(VmCommand.java:120) [bll.jar:]
        at org.ovirt.engine.core.bll.CommandBase.executeWithoutTransaction(CommandBase.java:1255) [bll.jar:]
        at org.ovirt.engine.core.bll.CommandBase.executeActionInTransactionScope(CommandBase.java:1395) [bll.jar:]
        at org.ovirt.engine.core.bll.CommandBase.runInTransaction(CommandBase.java:2059) [bll.jar:]
        at org.ovirt.engine.core.utils.transaction.TransactionSupport.executeInNewTransaction(TransactionSupport.java:202) [utils.jar:]
        at org.ovirt.engine.core.utils.transaction.TransactionSupport.executeInRequired(TransactionSupport.java:137) [utils.jar:]
        at org.ovirt.engine.core.utils.transaction.TransactionSupport.executeInScope(TransactionSupport.java:105) [utils.jar:]
        at org.ovirt.engine.core.bll.CommandBase.execute(CommandBase.java:1455) [bll.jar:]
        at org.ovirt.engine.core.bll.CommandBase.executeAction(CommandBase.java:397) [bll.jar:]
        at org.ovirt.engine.core.bll.executor.DefaultBackendActionExecutor.execute(DefaultBackendActionExecutor.java:13) [bll.jar:]
        at org.ovirt.engine.core.bll.Backend.runAction(Backend.java:516) [bll.jar:]
        at org.ovirt.engine.core.bll.Backend.runActionImpl(Backend.java:498) [bll.jar:]
        at org.ovirt.engine.core.bll.Backend.runAction(Backend.java:451) [bll.jar:]
(...)
Caused by: org.postgresql.util.PSQLException: ERROR: deadlock detected
  Detail: Process 23976 waits for ShareLock on transaction 37761211; blocked by process 24090.
Process 24090 waits for ShareLock on transaction 37761182; blocked by process 23976.
  Hint: See server log for query details.
  Where: SQL statement "UPDATE vm_device
    SET boot_order = v_boot_order,
        _update_date = current_timestamp
    WHERE device_id = v_device_id
        AND vm_id = v_vm_id"
PL/pgSQL function updatevmdevicebootorder(uuid,uuid,integer) line 3 at SQL statement
        at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2157)
        at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:1886)
        at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:405)
        at org.postgresql.jdbc2.AbstractJdbc2Statement.executeBatch(AbstractJdbc2Statement.java:2893)
        at org.jboss.jca.adapters.jdbc.CachedPreparedStatement.executeBatch(CachedPreparedStatement.java:714)
        at org.jboss.jca.adapters.jdbc.WrappedStatement.executeBatch(WrappedStatement.java:1180)
        at org.ovirt.engine.core.dal.dbbroker.BatchProcedureExecutionConnectionCallback.doInConnection(BatchProcedureExecutionConnectionCallback.java:62) [dal.jar:]
        at org.springframework.jdbc.core.JdbcTemplate.execute(JdbcTemplate.java:349) [spring-jdbc.jar:4.2.4.RELEASE]
        ... 191 more

2018-04-04 17:17:09,788Z ERROR [org.ovirt.engine.core.bll.storage.disk.DetachDiskFromVmCommand] (default task-42) [dfd9e427-683c-41bf-afc0-370b12de0581] Transaction rolled-back for command 'org.ovirt.engine.core.bll.storage.disk.DetachDiskFromVmCommand'.
2018-04-04 17:17:09,803Z ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (default task-42) [dfd9e427-683c-41bf-afc0-370b12de0581] EVENT_ID: USER_FAILED_DETACH_DISK_FROM_VM(2,019), Correlation ID: dfd9e427-683c-41bf-afc0-370b12de0581, Call Stack: null, Custom ID: null, Custom Event ID: -1, Message: Failed to detach Disk hds-monitor2_Disk1 from VM backup-rhv (User: admin@internal-authz).
2018-04-04 17:17:09,804Z ERROR [org.ovirt.engine.api.restapi.resource.AbstractBackendResource] (default task-42) [] Operation Failed: [Internal Engine Error]


And this is the information logged by PostgreSQL:

ERROR:  deadlock detected
DETAIL:  Process 23976 waits for ShareLock on transaction 37761211; blocked by process 24090.
	Process 24090 waits for ShareLock on transaction 37761182; blocked by process 23976.
	Process 23976: select * from public.updatevmdevicebootorder(CAST ($1 AS uuid),CAST ($2 AS uuid),CAST ($3 AS int4)) as result
	Process 24090: select * from public.updatevmdevice(CAST ($1 AS uuid),CAST ($2 AS uuid),CAST ($3 AS varchar),CAST ($4 AS varchar),CAST ($5 AS varchar),CAST ($6 AS int4),CAST ($7 AS text),CAST ($8 AS bool),CAST ($9 AS bool),CAST ($10 AS bool),CAST ($11 AS varchar),CAST ($12 AS text),CAST ($13 AS uuid),CAST ($14 AS varchar),CAST ($15 AS varchar)) as result
HINT:  See server log for query details.
CONTEXT:  SQL statement "UPDATE vm_device
	    SET boot_order = v_boot_order,
	        _update_date = current_timestamp
	    WHERE device_id = v_device_id
	        AND vm_id = v_vm_id"
	PL/pgSQL function updatevmdevicebootorder(uuid,uuid,integer) line 3 at SQL statement
STATEMENT:  select * from public.updatevmdevicebootorder(CAST ($1 AS uuid),CAST ($2 AS uuid),CAST ($3 AS int4)) as result


Looks like the execution of multiple UpdateVmDeviceBootOrder in a batch, may collide with another UpdateVmDevice running in parallel. 


Version-Release number of selected component (if applicable):

ovirt-engine-4.1.10.3-0.1.el7.noarch


How reproducible:

Quite frequently while running a Commvault VM backup process.

Comment 3 Daniel Erez 2018-04-09 14:29:53 UTC
The issue should be already resolved in 4.2 as part of:
https://gerrit.ovirt.org/#/c/71346/
https://gerrit.ovirt.org/#/c/71333/

Since the change seems too major for backporting to 4.1, moving to ON_QA for verification.

Comment 9 Yaniv Kaul 2018-04-10 10:06:52 UTC
Has QE reproduced this?

Comment 10 Elad 2018-04-11 09:21:13 UTC
I don't see clear steps to reproduce here so I can't tell if we have this tested

Comment 18 RHV bug bot 2018-04-13 09:53:14 UTC
WARN: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason:

[Found non-acked flags: '{'rhevm-4.2-ga': '?'}', ]

For more info please contact: rhv-devops@redhat.comINFO: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason:

[Found non-acked flags: '{'rhevm-4.2-ga': '?'}', ]

For more info please contact: rhv-devops@redhat.com

Comment 19 Elad 2018-04-13 11:56:42 UTC
Execute all disk hotplug automation tests for verification.

Comment 25 Elad 2018-04-29 09:54:00 UTC
Executed storage automation hotplug/unplug disk sanity over iscsi and nfs. Passed
For a reference, these are the tests cases in Polarion that were executed:

TestCase5033
TestCase5034
TestCase5035
TestCase5036
TestCase5041
TestCase6231


Also, as stated in https://bugzilla.redhat.com/show_bug.cgi?id=1568075#c26, I went over the latest 4.2.3-4 automation tier1 which includes VM sanity cases and boot sequence in particular. For example rhevmtests.compute.virt.run_once.run_once_test. All passed 


Used:
rhvm-4.2.3.3-0.1.el7.noarch
vdsm-4.20.27-1.el7ev.x86_64
libvirt-3.9.0-14.el7_5.3.x86_64
qemu-kvm-rhev-2.10.0-21.el7_5.2.x86_64

Comment 29 errata-xmlrpc 2018-05-15 17:48:41 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2018:1488

Comment 30 Elad 2018-08-02 08:31:13 UTC
Regression testing of disk hotplug/unplug covers here.

Comment 31 Franta Kust 2019-05-16 13:06:24 UTC
BZ<2>Jira Resync


Note You need to log in before you can comment on or make changes to this bug.